How to remove \xa0 from string in Python?
Solution 1
\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
string = string.replace(u'\xa0', u' ')
When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.
Read up on http://docs.python.org/howto/unicode.html.
Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize
now
Solution 2
There's many useful things in Python's unicodedata
library. One of them is the .normalize()
function.
Try:
new_str = unicodedata.normalize("NFKD", unicode_str)
Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.
Solution 3
After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.
Assume we have our raw html as following:
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
So lets try to clean this HTML string:
from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'
The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.
Method # 1 (Recommended): The first one is BeautifulSoup's get_text method with strip argument as True So our code becomes:
clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks
Method # 2: The other option is to use python's library unicodedata
import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'
I have also detailed these methods on this blog which you may want to refer.
Solution 4
Try using .strip() at the end of your line
line.strip()
worked well for me
Solution 5
try this:
string.replace('\\xa0', ' ')
zhuyxn
Updated on January 31, 2021Comments
-
zhuyxn about 3 years
I am currently using Beautiful Soup to parse an HTML file and calling
get_text()
, but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?I tried using:
line = line.replace(u'\xa0',' ')
, as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):EDIT: The problem seems to be resolved by
str.replace(u'\xa0', ' ').encode('utf-8')
, but just doing.encode('utf-8')
withoutreplace()
seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?-
zhuyxn almost 12 yearstried that already, 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
-
jpaugh almost 12 yearsembrace Unicode. Use
u''
s instead of''
s. :-) -
zhuyxn almost 12 yearstried using str.replace(u'\xa0', ' ') but got "u"s everywhere instead of \xa0s :/
-
pepr almost 12 yearsIf the string is the unicode one, you have to use the
u' '
replacement, not the' '
. Is the original string the unicode one?
-
-
Martijn Pieters over 11 yearsYou are now removing anything that isn't a ASCII character, you are probably masking your actual problem. Using
'ignore'
is like shoving through the shift stick even though you don't understand how the clutch works.. -
dbr over 10 years@MartijnPieters The linked unicode tutorial is good, but you are completely correct -
str.encode(..., 'ignore')
is the Unicode-handling equivalent oftry: ... except: ...
. While it might hide the error message, it rarely solves the problem. -
dbr over 10 yearsI don't know a huge amount about Unicode and character encodings.. but it seems like unicodedata.normalize would be more appropriate than str.replace
-
Admin over 9 yearsYours is workable advice for strings, but note that all references to this string will also need to be replaced. For example, if you have a program that opens files, and one of the files has a non-breaking space in its name, you will need to rename that file in addition to doing this replacement.
-
andilabs over 9 yearsfor some purposes like dealing with EMAIL or URLS it seems perfect to use
.decode('ascii', 'ignore')
-
jfs about 9 yearsU+00a0 is a non-breakable space Unicode character that can be encoded as
b'\xa0'
byte in latin1 encoding, as two bytesb'\xc2\xa0'
in utf-8 encoding. It can be represented as
in html. -
jfs about 9 years@RyanMartin: this replaces four bytes:
len(b'\\xa0') == 4
butlen(b'\xa0') == 1
. If possible; you should fix upstream that generates these escapes. -
jfs about 9 yearssamwize's answer didn't work for you because it works on Unicode strings.
line.decode()
in your answer suggests that your input is a bytestring (you should not call.decode()
on a Unicode string (to enforce it, the method is removed in Python 3). I don't understand how it is possible to see the tutorial that you've linked in your answer and miss the difference between bytes and Unicode (do not mix them). -
jfs about 9 yearsthis works if
text
is a bytestring that represents a text encoded using utf-8. If you are working with text; decode it to Unicode first (.decode('utf-8')
) and encode it to a bytestring only at the very end (if API does not support Unicode directly e.g.,socket
). All intermediate operations on the text should be performed on Unicode. -
jfs about 9 years
strip=True
works only if
is at the beginning or end of each bit of text. It won't remove the space if it is inbetween other characters in the text. -
jfs about 9 years
0xc2a0
is ambiguous (byte order). Useb'\xc2\xa0'
bytes literal instead. -
jds almost 9 yearsWhen I try this, I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 397: ordinal not in range(128)
. -
Mushroom Man over 7 yearsI tried this code on a list of strings, it didn't do anything, and the \xa0 character remained. If I reencoded my text file to UTF-8, the character would appear as an upper case A with a carrot on it's head, and I encoded it in Unicode the Python interpreter crashed.
-
José Tomás Tocino almost 7 yearsThis did the trick. Had some HTML generated by... Microsoft Word with lots of weird unicode characters and this somehow cleaned them all.
-
Faccion over 6 yearsNot so sure, you may want
normalize('NFKD', '1º\xa0dia')
to return '1º dia' but it returns '1o dia' -
TT-- over 6 yearshere is the docs about
unicodedata.normalize
-
Cho over 4 yearsah, if text is 'KOREAN', do not try this. 글자가 전부 깨져버리네요.
-
arizqi almost 4 yearsThis solution changes Russian letter
й
to an identically looking sequence of two unicode characters. The problem here is that strings that used to be equal do not match anymore. Fix: use"NFKC"
instead of"NFKD"
. -
Jenya Pu almost 4 yearsThis solution worked for me:
string.replace('\xa0', ' ')
-
the_economist almost 4 yearsIt doesn't chatch the 'soft hyphen' (-) which is '\xad' in Latin1. Are there any trick to also catch this symbol?
-
the_economist almost 4 years@Markus: The same applies to the German Umlaute ö, ü and ä. 'NFKC' is required instead of 'NFKD'.
-
Bill about 3 yearsThis will only remove it if it's at the beginning or end of the string.
-
Amir Shabani almost 3 yearsThis is awesome. It changes the one-letter string
﷼
to the four-letter stringریال
that it actually is. So it's much easier to replace when needed. You'd normalize and then replace, without having to care which one it was.normalize("NFKD", "﷼").replace("ریال", '')
. -
ChewChew over 2 yearsget_text(strip=True) really did a trick. Thanks m8
-
Jean Monet about 2 years@dbr
unicodedata
does not replace\xa0
withNFC
(which properly retains letters with accent such asé
). Example:unicodedata.normalize("NFC", "LEFT\xa0RIGHT") == "LEFT\xa0RIGHT"
. -
Y4RD13 almost 2 yearsthis is very specific for raw html returning unicode after cleaning with bs4 or regex. Works perfectly, but it will not remove line breaks or tabs