replace special characters in a string python

python string list replace urllib

183,392

Solution 1

str.replace is the wrong function for what you want to do (apart from it being used incorrectly). You want to replace any character of a set with a space, not the whole set with a single space (the latter is what replace does). You can use translate like this:

removeSpecialChars = z.translate ({ord(c): " " for c in "!@#$%^&*()[]{};:,./<>?\|`~-=_+"})

This creates a mapping which maps every character in your list of special characters to a space, then calls translate() on the string, replacing every single character in the set of special characters with a space.

Solution 2

One way is to use re.sub, that's my preferred way.

import re
my_str = "hey th~!ere"
my_new_string = re.sub('[^a-zA-Z0-9 \n\.]', '', my_str)
print my_new_string

Output:

hey there

Another way is to use re.escape:

import string
import re

my_str = "hey th~!ere"

chars = re.escape(string.punctuation)
print re.sub(r'['+chars+']', '',my_str)

Output:

hey there

Just a small tip about parameters style in python by PEP-8 parameters should be remove_special_chars and not removeSpecialChars

Also if you want to keep the spaces just change [^a-zA-Z0-9 \n\.] to [^a-zA-Z0-9\n\.]

Solution 3

You need to call replace on z and not on str, since you want to replace characters located in the string variable z

removeSpecialChars = z.replace("!@#$%^&*()[]{};:,./<>?\|`~-=_+", " ")

But this will not work, as replace looks for a substring, you will most likely need to use regular expression module re with the sub function:

import re
removeSpecialChars = re.sub("[!@#$%^&*()[]{};:,./<>?\|`~-=_+]", " ", z)

Don't forget the [], which indicates that this is a set of characters to be replaced.

Solution 4

replace operates on a specific string, so you need to call it like this

removeSpecialChars = z.replace("!@#$%^&*()[]{};:,./<>?\|`~-=_+", " ")

but this is probably not what you need, since this will look for a single string containing all that characters in the same order. you can do it with a regexp, as Danny Michaud pointed out.

as a side note, you might want to look for BeautifulSoup, which is a library for parsing messy HTML formatted text like what you usually get from scaping websites.

Solution 5

You can replace the special characters with the desired characters as follows,

import string
specialCharacterText = "H#y #@w @re &*)?"
inCharSet = "!@#$%^&*()[]{};:,./<>?\|`~-=_+\""
outCharSet = "                               " #corresponding characters in inCharSet to be replaced
splCharReplaceList = string.maketrans(inCharSet, outCharSet)
splCharFreeString = specialCharacterText.translate(splCharReplaceList)

View more solutions

183,392

Author by

user2363217

Updated on July 09, 2022

Comments

user2363217 almost 2 years

I am using urllib to get a string of html from a website and need to put each word in the html document into a list.

Here is the code I have so far. I keep getting an error. I have also copied the error below.

import urllib.request

url = input("Please enter a URL: ")

z=urllib.request.urlopen(url)
z=str(z.read())
removeSpecialChars = str.replace("!@#$%^&*()[]{};:,./<>?\|`~-=_+", " ")

words = removeSpecialChars.split()

print ("Words list: ", words[0:20])

Here is the error.

Please enter a URL: http://simleyfootball.com
Traceback (most recent call last):
  File "C:\Users\jeremy.KLUG\My Documents\LiClipse Workspace\Python Project 2\Module2.py", line 7, in <module>
    removeSpecialChars = str.replace("!@#$%^&*()[]{};:,./<>?\|`~-=_+", " ")
TypeError: replace() takes at least 2 arguments (1 given)

user2363217 almost 10 years

I have to use just the libraries included in python. Is there regex that could accomplish what I am trying to do?
Pavel almost 10 years

it depends on whether you are about to work with English texts, texts that include foreign words (with accents, umlauts, etc.), digits, currency symbols etc. There is no universal regex to "clean up stuff", you need to be specific about what you need.
thibault ketterer almost 9 years

+1 clearly the fastest and best answer it handles every case, translate will not do anything if given strange utf8 characters, re.sub with negative regex [^...] is much safer.
Vreddhi Bhat almost 7 years

Are you sure regex will perform better than translate? translate might be using regex internally ?
bergercookie about 5 years

Very well done for the ord use! Otherwise str.translate on special characters does nothing.
vineeshvs about 5 years

how to replace the character ` using re.sub?
Jinhua Wang over 4 years

Thanks! This answer saved my day.
AdamAL over 3 years

Note that this replaces anything IN a set of characters, while this answer replaces anything NOT IN a regex match. The latter is probably a safer approach if the goal is to make a string "safe" for a given context.
radouxju about 3 years

very helpfull answer, but on the last line don't you mean that using [^a-zA-Z0-9\n\.] will REMOVE the spaces ?