Python correct encoding of Website (Beautiful Soup)

python encoding utf-8 beautifulsoup mojibake

22,600

Solution 1

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser'  # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)

Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():

text = soup.get_text()
print text

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

Solution 2

It's not BeautifulSoup's fault. You can see this by printing out encodedText, before you ever use BeautifulSoup: the non-ASCII characters are already gibberish.

The problem here is that you are mixing up bytes and characters. For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.

A look at the requests documentation shows that r.text is made of characters, not bytes. You shouldn't be encoding it. If you try to do so, you will make a byte string, and when you try to treat that as characters, bad things will happen.

There are two ways to get around this:

Use the raw undecoded bytes, which are stored in r.content, as Martijn suggested. Then you can decode them yourself to turn them into characters.
Let requests do the decoding, but just make sure it uses the right codec. Since you know that's UTF-8 in this case, you can set r.encoding = 'utf-8'. If you do this before you access r.text, then when you do access r.text, it will have been properly decoded, and you get a character string. You don't need to mess with character encodings at all.

Incidentally, Python 3 makes it somewhat easier to maintain the difference between character strings and byte strings, because it requires you to use different types of objects to represent them.

Solution 3

There are a couple of errors in your code:

First of all, your attempt at re-encoding the text is not needed. Requests can give you the native encoding of the page and BeautifulSoup can take this info and do the decoding itself:

# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup

url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)

soup = BeautifulSoup(r.text, "html5lib")

Second of all, you have an encoding issue. You are probably trying to visualize the results on the terminal. What you will get is the unicode representation of the characters in the text for every character that is not in the ASCII set. You can check the results like this:
```
res = [item.encode("ascii","ignore") for item in soup.find_all(text=True)]
```

22,600

Author by

user1767754

Updated on October 04, 2021

Comments

user1767754 over 2 years

I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding.

Source:

# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup

url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)

encodedText = r.text.encode("utf-8")
soup = BeautifulSoup(encodedText)
text =  str(soup.findAll(text=True))
print text.decode("utf-8")

Excerpt Output:

...Odenw\xc3\xa4lderisch...

this should be Odenwälderisch

gsnedders about 8 years

from_encoding=r.encoding won't actually work, because requests defaults to ISO-8859-1, whereas web content relies on parsing out meta elements.
gsnedders about 8 years

FWIW, RFC 2616 is obsolete, and RFC 7231 makes the default what the media type definition says. So, uh, the HTTP standard doesn't say that any more. (Yay specs reflecting reality!) You might want to have ';charset for some extra resilience, BTW (foo/charset could theoretically exist, after all).
Martijn Pieters about 8 years

@gsnedders: this is specific to HTML processing, so we should be safe there! :-)
lesingerouge about 8 years

@gsnedders didn't know that. thanks for sharing! edited my asnwer accordingly.
Alex Shpilkin over 5 years

@gsnedders Nothing prohibits whitespace before the semicolon and charset, so this might not work either, unfortunately.