UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'

53,106

Solution 1

In Python 2, unicode objects can only be printed if they can be converted to ASCII. If it can't be encoded in ASCII, you'll get that error. You probably want to explicitly encode it and then print the resulting str:

print post.text.encode('utf-8')

Solution 2

    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

worked for me ;-)

Share:
53,106
user1063287
Author by

user1063287

Here are some answers I hope will be helpful to others: JavaScript If primitives are immutable and have no methods, why can a method be called on a string? Security How to secure a comment form in a non-CMS environment with no user authentication? Node How to use Helmet to define Content Security Policy? How do routes, middleware and next work in Express? How to split a string into chunks of a particular byte size? OpenShift How to view OpenShift Online node application logs locally? How to connect to OpenShift 3 MongoDB remotely? Miscellaneous (Git Bash) How to create a multi-terminal Git Bash environment with Windows Terminal? (Codepen) How to add a syntax highlighting theme to codepen? And here are some links to recommended learning content: JavaScript: Understanding the Weird Parts - The First 3.5 Hours (YouTube video link) This is a great, and deep, introduction to JavaScript that will answer a lot of questions new JS developers will have about common terminology and dynamics. I wish I had viewed it when I first started.

Updated on July 09, 2022

Comments

  • user1063287
    user1063287 almost 2 years

    I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:

    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)
    

    There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:

    I want to print post.text (where text is a beautiful soup method that just returns the text). str(post.text) and post.text produce the unicode errors (on things like right apostrophe's ' and ...).

    So I add post = unicode(post) above str(post.text), then I get:

    AttributeError: 'unicode' object has no attribute 'text'
    

    I also tried (post.text).encode() and (post.text).renderContents(). The latter producing the error:

    AttributeError: 'unicode' object has no attribute 'renderContents'
    

    and then I tried str(post.text).renderContents() and got the error:

    AttributeError: 'str' object has no attribute 'renderContents'
    

    It would be great if I could just define somewhere at the top of the document 'make this content 'interpretable'' and still have access to the required text function.


    Update: after suggestions:

    If I add post = post.decode("utf-8") above str(post.text) I get:

    TypeError: unsupported operand type(s) for -: 'str' and 'int'  
    

    If I add post = post.decode() above str(post.text) I get:

    AttributeError: 'unicode' object has no attribute 'text'
    

    If I add post = post.encode("utf-8") above (post.text) I get:

    AttributeError: 'str' object has no attribute 'text'
    

    I tried print post.text.encode('utf-8') and got:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)
    

    And for the sake of trying things that might work, I installed lxml for Windows from here and implemented it with:

    parsed_content = BeautifulSoup(original_content, "lxml")
    

    according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.

    These steps didn't seem to make a difference.

    I'm using Python 2.7.4 and Beautiful Soup 4.


    Solution:

    After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my str methods and concatenations, e.g. str(something) + post.text + str(something_else), so that it was something, post.text, something_else and it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at ,).