How to work with unicode in Python
Solution 1
may be you should be doing
s=unicodestring.replace(u'\xa0',u'')
Solution 2
s=unicodestring.replace('\xa0','')
..is trying to create the unicode character \xa0
, which is not valid in an ASCII sctring (the default string type in Python until version 3.x)
The reason r'\xa0'
did not error is because in a raw string, escape sequences have no effect. Rather than trying to encode \xa0
into the unicode character, it saw the string as a "literal backslash", "literal x" and so on..
The following are the same:
>>> r'\xa0'
'\\xa0'
>>> '\\xa0'
'\\xa0'
This is something resolved in Python v3, as the default string type is unicode, so you can just do..
>>> '\xa0'
'\xa0'
I am trying to clean all of the HTML out of a string so the final output is a text file
I would strongly recommend BeautifulSoup for this. Writing an HTML cleaning tool is difficult (given how horrible most HTML is), and BeautifulSoup does a great job at both parsing HTML, and dealing with Unicode..
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
<body>
<h1>
Hi
</h1>
</body>
</html>
Solution 3
Look at the codecs standard library, specifically the encode and decode methods provided in the Codec base class.
There's also a good article here that puts it all together.
Solution 4
Instead of this, it's better to use standard python features.
For example:
string = unicode('Hello, \xa0World', 'utf-8', 'replace')
or
string = unicode('Hello, \xa0World', 'utf-8', 'ignore')
where replace
will replace \xa0
to \\xa0
.
But if \xa0
is really not meaningful for you and you want to remove it then use ignore
.
Solution 5
Just a note regarding HTML cleaning. It is very very hard, since
<
body
>
Is a valid way to write HTML. Just an fyi.
PyNEwbie
Academic researcher who discovered how python can make many tedious and seemingly impossible tasks become approachable and possible. Self taught. I went from knowing no python to having a small software company that sells specialized tools to academic researchers around the world in under a year. SOreadytohelp
Updated on June 14, 2022Comments
-
PyNEwbie almost 2 years
I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
The actual line of code was
s=unicodestring.replace('\xa0','')
Anyway-I decided that I needed to preface it with an r so I ran this line of code:
s=unicodestring.replace(r'\xa0','')
It runs without error but I when I look at a slice of s I see that the \xaO is still there
-
David Z about 15 yearsWhy would you prefix '\xa0' with an r? That makes it a raw string - that is, it literally contains backslash, x, a, 0. Without the r, it contained a single character with hex code a0, which I think is what you wanted.
-
PyNEwbie about 15 yearsBecause I was trying to guess why I got the error and I know that sometimes to force the \ to be read you have to make it a string literal and also the \xa0 is what actually exists in my source. what is hex code a0?
-
-
PyNEwbie about 15 yearsSo how did you know to do this since I have not seen this in any example? Thanks
-
PyNEwbie about 15 yearsThanks-great article you are right it does put a lot together.
-
PyNEwbie about 15 yearsI appreciate this answer. I have used BS to extract data from tables and it is very useful. However, it seems to me that to remove the html using BS I have to know what is present. Am I wrong about that?
-
dbr about 15 yearsI'm not sure what you mean? You can remove HTML via countless ways, from the first table in a div, to by-class-or-id etc..
-
Gourneau almost 12 yearsBeautifulSoup.prettyify() was just a life saver! Thanks!