How to remove \xa0 from string in Python?

python python-2.7 unicode beautifulsoup utf-8

375,851

Solution 1

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

Solution 2

There's many useful things in Python's unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.

Solution 3

After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.

Assume we have our raw html as following:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

So lets try to clean this HTML string:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.

Method # 1 (Recommended): The first one is BeautifulSoup's get_text method with strip argument as True So our code becomes:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2: The other option is to use python's library unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

I have also detailed these methods on this blog which you may want to refer.

Solution 4

Try using .strip() at the end of your line line.strip() worked well for me

Solution 5

try this:

string.replace('\\xa0', ' ')

View more solutions

375,851

Author by

zhuyxn

Updated on January 31, 2021

Comments

zhuyxn about 3 years

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):

EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?
- zhuyxn almost 12 years
  
  tried that already, 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
- jpaugh almost 12 years
  
  embrace Unicode. Use u''s instead of ''s. :-)
- zhuyxn almost 12 years
  
  tried using str.replace(u'\xa0', ' ') but got "u"s everywhere instead of \xa0s :/
- pepr almost 12 years
  
  If the string is the unicode one, you have to use the u' ' replacement, not the ' '. Is the original string the unicode one?
Martijn Pieters over 11 years

You are now removing anything that isn't a ASCII character, you are probably masking your actual problem. Using 'ignore' is like shoving through the shift stick even though you don't understand how the clutch works..
dbr over 10 years

@MartijnPieters The linked unicode tutorial is good, but you are completely correct - str.encode(..., 'ignore') is the Unicode-handling equivalent of try: ... except: .... While it might hide the error message, it rarely solves the problem.
dbr over 10 years

I don't know a huge amount about Unicode and character encodings.. but it seems like unicodedata.normalize would be more appropriate than str.replace
Admin over 9 years

Yours is workable advice for strings, but note that all references to this string will also need to be replaced. For example, if you have a program that opens files, and one of the files has a non-breaking space in its name, you will need to rename that file in addition to doing this replacement.
andilabs over 9 years

for some purposes like dealing with EMAIL or URLS it seems perfect to use .decode('ascii', 'ignore')
jfs about 9 years

U+00a0 is a non-breakable space Unicode character that can be encoded as b'\xa0' byte in latin1 encoding, as two bytes b'\xc2\xa0' in utf-8 encoding. It can be represented as   in html.
jfs about 9 years

@RyanMartin: this replaces four bytes: len(b'\\xa0') == 4 but len(b'\xa0') == 1. If possible; you should fix upstream that generates these escapes.
jfs about 9 years

samwize's answer didn't work for you because it works on Unicode strings. line.decode() in your answer suggests that your input is a bytestring (you should not call .decode() on a Unicode string (to enforce it, the method is removed in Python 3). I don't understand how it is possible to see the tutorial that you've linked in your answer and miss the difference between bytes and Unicode (do not mix them).
jfs about 9 years

this works if text is a bytestring that represents a text encoded using utf-8. If you are working with text; decode it to Unicode first (.decode('utf-8')) and encode it to a bytestring only at the very end (if API does not support Unicode directly e.g., socket). All intermediate operations on the text should be performed on Unicode.
jfs about 9 years

strip=True works only if   is at the beginning or end of each bit of text. It won't remove the space if it is inbetween other characters in the text.
jfs about 9 years

0xc2a0 is ambiguous (byte order). Use b'\xc2\xa0' bytes literal instead.
jds almost 9 years

When I try this, I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 397: ordinal not in range(128).
Mushroom Man over 7 years

I tried this code on a list of strings, it didn't do anything, and the \xa0 character remained. If I reencoded my text file to UTF-8, the character would appear as an upper case A with a carrot on it's head, and I encoded it in Unicode the Python interpreter crashed.
José Tomás Tocino almost 7 years

This did the trick. Had some HTML generated by... Microsoft Word with lots of weird unicode characters and this somehow cleaned them all.
Faccion over 6 years

Not so sure, you may want normalize('NFKD', '1º\xa0dia') to return '1º dia' but it returns '1o dia'
TT-- over 6 years

here is the docs about unicodedata.normalize
Cho over 4 years

ah, if text is 'KOREAN', do not try this. 글자가 전부 깨져버리네요.
arizqi almost 4 years

This solution changes Russian letter й to an identically looking sequence of two unicode characters. The problem here is that strings that used to be equal do not match anymore. Fix: use "NFKC" instead of "NFKD".
Jenya Pu almost 4 years

This solution worked for me: string.replace('\xa0', ' ')
the_economist almost 4 years

It doesn't chatch the 'soft hyphen' (-) which is '\xad' in Latin1. Are there any trick to also catch this symbol?
the_economist almost 4 years

@Markus: The same applies to the German Umlaute ö, ü and ä. 'NFKC' is required instead of 'NFKD'.
Bill about 3 years

This will only remove it if it's at the beginning or end of the string.
Amir Shabani almost 3 years

This is awesome. It changes the one-letter string ﷼ to the four-letter string ریال that it actually is. So it's much easier to replace when needed. You'd normalize and then replace, without having to care which one it was. normalize("NFKD", "﷼").replace("ریال", '').
ChewChew over 2 years

get_text(strip=True) really did a trick. Thanks m8
Jean Monet about 2 years

@dbr unicodedata does not replace \xa0 with NFC (which properly retains letters with accent such as é). Example: unicodedata.normalize("NFC", "LEFT\xa0RIGHT") == "LEFT\xa0RIGHT".
Y4RD13 almost 2 years

this is very specific for raw html returning unicode after cleaning with bs4 or regex. Works perfectly, but it will not remove line breaks or tabs