UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'
Solution 1
In Python 2, unicode
objects can only be printed if they can be converted to ASCII. If it can't be encoded in ASCII, you'll get that error. You probably want to explicitly encode it and then print the resulting str
:
print post.text.encode('utf-8')
Solution 2
html = urllib.request.urlopen(THE_URL).read()
soup = BeautifulSoup(html)
print("'" + str(soup.encode("ascii")) + "'")
worked for me ;-)
user1063287
Here are some answers I hope will be helpful to others: JavaScript If primitives are immutable and have no methods, why can a method be called on a string? Security How to secure a comment form in a non-CMS environment with no user authentication? Node How to use Helmet to define Content Security Policy? How do routes, middleware and next work in Express? How to split a string into chunks of a particular byte size? OpenShift How to view OpenShift Online node application logs locally? How to connect to OpenShift 3 MongoDB remotely? Miscellaneous (Git Bash) How to create a multi-terminal Git Bash environment with Windows Terminal? (Codepen) How to add a syntax highlighting theme to codepen? And here are some links to recommended learning content: JavaScript: Understanding the Weird Parts - The First 3.5 Hours (YouTube video link) This is a great, and deep, introduction to JavaScript that will answer a lot of questions new JS developers will have about common terminology and dynamics. I wish I had viewed it when I first started.
Updated on July 09, 2022Comments
-
user1063287 almost 2 years
I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)
There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:
I want to print
post.text
(where text is a beautiful soup method that just returns the text).str(post.text)
andpost.text
produce the unicode errors (on things like right apostrophe's'
and...
).So I add
post = unicode(post)
abovestr(post.text)
, then I get:AttributeError: 'unicode' object has no attribute 'text'
I also tried
(post.text).encode()
and(post.text).renderContents()
. The latter producing the error:AttributeError: 'unicode' object has no attribute 'renderContents'
and then I tried
str(post.text).renderContents()
and got the error:AttributeError: 'str' object has no attribute 'renderContents'
It would be great if I could just define somewhere at the top of the document
'make this content 'interpretable''
and still have access to the requiredtext
function.
Update: after suggestions:
If I add
post = post.decode("utf-8")
abovestr(post.text)
I get:TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I add
post = post.decode()
abovestr(post.text)
I get:AttributeError: 'unicode' object has no attribute 'text'
If I add
post = post.encode("utf-8")
above(post.text)
I get:AttributeError: 'str' object has no attribute 'text'
I tried
print post.text.encode('utf-8')
and got:UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)
And for the sake of trying things that might work, I installed lxml for Windows from here and implemented it with:
parsed_content = BeautifulSoup(original_content, "lxml")
according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.
These steps didn't seem to make a difference.
I'm using Python 2.7.4 and Beautiful Soup 4.
Solution:
After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my
str
methods and concatenations, e.g.str(something) + post.text + str(something_else)
, so that it wassomething, post.text, something_else
and it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at,
).