How to remove \xa0 from string in Python?

375,851

Solution 1

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

string = string.replace(u'\xa0', u' ')

When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.

Read up on http://docs.python.org/howto/unicode.html.

Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now

Solution 2

There's many useful things in Python's unicodedata library. One of them is the .normalize() function.

Try:

new_str = unicodedata.normalize("NFKD", unicode_str)

Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.

Solution 3

After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.

Assume we have our raw html as following:

raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'

So lets try to clean this HTML string:

from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'

The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.

Method # 1 (Recommended): The first one is BeautifulSoup's get_text method with strip argument as True So our code becomes:

clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks

Method # 2: The other option is to use python's library unicodedata

import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'

I have also detailed these methods on this blog which you may want to refer.

Solution 4

Try using .strip() at the end of your line line.strip() worked well for me

Solution 5

try this:

string.replace('\\xa0', ' ')
Share:
375,851
zhuyxn
Author by

zhuyxn

Updated on January 31, 2021

Comments

  • zhuyxn
    zhuyxn about 3 years

    I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?

    I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):

    EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?

    • zhuyxn
      zhuyxn almost 12 years
      tried that already, 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
    • jpaugh
      jpaugh almost 12 years
      embrace Unicode. Use u''s instead of ''s. :-)
    • zhuyxn
      zhuyxn almost 12 years
      tried using str.replace(u'\xa0', ' ') but got "u"s everywhere instead of \xa0s :/
    • pepr
      pepr almost 12 years
      If the string is the unicode one, you have to use the u' ' replacement, not the ' '. Is the original string the unicode one?
  • Martijn Pieters
    Martijn Pieters over 11 years
    You are now removing anything that isn't a ASCII character, you are probably masking your actual problem. Using 'ignore' is like shoving through the shift stick even though you don't understand how the clutch works..
  • dbr
    dbr over 10 years
    @MartijnPieters The linked unicode tutorial is good, but you are completely correct - str.encode(..., 'ignore') is the Unicode-handling equivalent of try: ... except: .... While it might hide the error message, it rarely solves the problem.
  • dbr
    dbr over 10 years
    I don't know a huge amount about Unicode and character encodings.. but it seems like unicodedata.normalize would be more appropriate than str.replace
  • Admin
    Admin over 9 years
    Yours is workable advice for strings, but note that all references to this string will also need to be replaced. For example, if you have a program that opens files, and one of the files has a non-breaking space in its name, you will need to rename that file in addition to doing this replacement.
  • andilabs
    andilabs over 9 years
    for some purposes like dealing with EMAIL or URLS it seems perfect to use .decode('ascii', 'ignore')
  • jfs
    jfs about 9 years
    U+00a0 is a non-breakable space Unicode character that can be encoded as b'\xa0' byte in latin1 encoding, as two bytes b'\xc2\xa0' in utf-8 encoding. It can be represented as &nbsp; in html.
  • jfs
    jfs about 9 years
    @RyanMartin: this replaces four bytes: len(b'\\xa0') == 4 but len(b'\xa0') == 1. If possible; you should fix upstream that generates these escapes.
  • jfs
    jfs about 9 years
    samwize's answer didn't work for you because it works on Unicode strings. line.decode() in your answer suggests that your input is a bytestring (you should not call .decode() on a Unicode string (to enforce it, the method is removed in Python 3). I don't understand how it is possible to see the tutorial that you've linked in your answer and miss the difference between bytes and Unicode (do not mix them).
  • jfs
    jfs about 9 years
    this works if text is a bytestring that represents a text encoded using utf-8. If you are working with text; decode it to Unicode first (.decode('utf-8')) and encode it to a bytestring only at the very end (if API does not support Unicode directly e.g., socket). All intermediate operations on the text should be performed on Unicode.
  • jfs
    jfs about 9 years
    strip=True works only if &nbsp; is at the beginning or end of each bit of text. It won't remove the space if it is inbetween other characters in the text.
  • jfs
    jfs about 9 years
    0xc2a0 is ambiguous (byte order). Use b'\xc2\xa0' bytes literal instead.
  • jds
    jds almost 9 years
    When I try this, I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 397: ordinal not in range(128).
  • Mushroom Man
    Mushroom Man over 7 years
    I tried this code on a list of strings, it didn't do anything, and the \xa0 character remained. If I reencoded my text file to UTF-8, the character would appear as an upper case A with a carrot on it's head, and I encoded it in Unicode the Python interpreter crashed.
  • José Tomás Tocino
    José Tomás Tocino almost 7 years
    This did the trick. Had some HTML generated by... Microsoft Word with lots of weird unicode characters and this somehow cleaned them all.
  • Faccion
    Faccion over 6 years
    Not so sure, you may want normalize('NFKD', '1º\xa0dia') to return '1º dia' but it returns '1o dia'
  • TT--
    TT-- over 6 years
  • Cho
    Cho over 4 years
    ah, if text is 'KOREAN', do not try this. 글자가 전부 깨져버리네요.
  • arizqi
    arizqi almost 4 years
    This solution changes Russian letter й to an identically looking sequence of two unicode characters. The problem here is that strings that used to be equal do not match anymore. Fix: use "NFKC" instead of "NFKD".
  • Jenya Pu
    Jenya Pu almost 4 years
    This solution worked for me: string.replace('\xa0', ' ')
  • the_economist
    the_economist almost 4 years
    It doesn't chatch the 'soft hyphen' (-) which is '\xad' in Latin1. Are there any trick to also catch this symbol?
  • the_economist
    the_economist almost 4 years
    @Markus: The same applies to the German Umlaute ö, ü and ä. 'NFKC' is required instead of 'NFKD'.
  • Bill
    Bill about 3 years
    This will only remove it if it's at the beginning or end of the string.
  • Amir Shabani
    Amir Shabani almost 3 years
    This is awesome. It changes the one-letter string to the four-letter string ریال that it actually is. So it's much easier to replace when needed. You'd normalize and then replace, without having to care which one it was. normalize("NFKD", "﷼").replace("ریال", '').
  • ChewChew
    ChewChew over 2 years
    get_text(strip=True) really did a trick. Thanks m8
  • Jean Monet
    Jean Monet about 2 years
    @dbr unicodedata does not replace \xa0 with NFC (which properly retains letters with accent such as é). Example: unicodedata.normalize("NFC", "LEFT\xa0RIGHT") == "LEFT\xa0RIGHT".
  • Y4RD13
    Y4RD13 almost 2 years
    this is very specific for raw html returning unicode after cleaning with bs4 or regex. Works perfectly, but it will not remove line breaks or tabs