TypeError: sequence item 1: expected a bytes-like object, str found
Solution 1
You have to choose between binary and text mode.
Either you open your file as rb
and then you can use re.sub(b"[^a-zA-Z]", b" ", text)
(text
is a bytes
object)
Or you open your file as r
and then you can use re.sub("[^a-zA-Z]", " ", text)
(text
is a str
object)
The second solution is more "classical".
Solution 2
The problem is with the repl
argument you supply, it isn't a bytes
object:
letters_only = re.sub(b"[^a-zA-Z]", " ", b'Hello2World')
# TypeError: sequence item 1: expected a bytes-like object, str found
Instead, supply repl
as a bytes instance b" "
:
letters_only = re.sub(b"[^a-zA-Z]", b" ", b'Hello2World')
print(letters_only)
b'Hello World'
Note: Don't prefix your literals with b
and don't open the file with rb
if you aren't looking for byte
sequences.
Solution 3
You can't use a byte
string for your regex match when the replacement string isn't.
Essentially, you can't mix different objects (byte
s and string
s) when doing most tasks. In your code above, you are using a binary search string and a binary text, but your replacement string is a regular string
. All arguments need to be of the same type, so there are 2 possible solutions to this.
Taking the above into account, your code could look like this (this will return regular string
strings, not byte
objects):
with open('/Users/some/directory/title.txt', 'r')as f:
text=f.read()
letters_only = re.sub(r"[^a-zA-Z]", " ", text)
words = letters_only.lower().split()
print(words)
Note that the code does use a special type of string for the regex - a raw string, prefixed with r
. This means that python won't interpret escape characters such as \
, which is very useful for regexes. See the docs for more details about raw strings.
Sherlock
Updated on June 12, 2022Comments
-
Sherlock almost 2 years
I am trying to extract English titles from a wiki titles dump that's in a text file using regex in Python 3. The wiki dump contains titles in other languages also and some symbols. Below is my code:
with open('/Users/some/directory/title.txt', 'rb')as f: text=f.read() letters_only = re.sub(b"[^a-zA-Z]", " ", text) words = letters_only.lower().split() print(words)
But I am getting an error:
TypeError: sequence item 1: expected a bytes-like object, str found
at the line:
letters_only = re.sub(b"[^a-zA-Z]", " ", text)
But, I am using
b''
to make output as byte type, below is a sample of the text file:Destroy-Oh-Boy!! !!Que_Corra_La_Voz!! !!_(chess) !!_(disambiguation) !'O!Kung !'O!Kung_language !'O-!khung_language !337$P34K != !? !?! !?Revolution!? !?_(chess) !A_Luchar! !Action_Pact! !Action_pact! !Adios_Amigos! !Alabadle! !Alarma! !Alarma!_(album) !Alarma!_(disambiguation) !Alarma!_(magazine) !Alarma!_Records !Alarma!_magazine !Alfaro_Vive,_Carajo! !All-Time_Quarterback! !All-Time_Quarterback!_(EP) !All-Time_Quarterback!_(album) !Alla_tu! !Amigos! !Amigos!_(Arrested_Development_episode) !Arriba!_La_Pachanga !Ask_a_Mexican! !Atame! !Ay,_Carmela!_(film) !Ay,_caramba! !BANG! !Bang! !Bang!_TV !Basta_Ya! !Bastardos! !Bastardos!_(album) !Bastardos_en_Vivo! !Bienvenido,_Mr._Marshall! !Ciauetistico! !Ciautistico! !DOCTYPE !Dame!_!Dame!_!Dame! !Decapitacion! !Dos! !Explora!_Science_Center_and_Children's_Museum !F !Forward,_Russia! !Forward_Russia! !Ga!ne_language !Ga!nge_language !Gã!ne !Gã!ne_language !Gã!nge_language !HERO !Happy_Birthday_Guadaloupe! !Happy_Birthday_Guadalupe! !Hello_Friends
I have searched online but could not succeed. Any help will be appreciated.
-
imant over 7 yearstry
re.sub("[^a-zA-Z]", " ", text)
instead -
Sherlock over 7 years@imant i tried this also but i am getting below error: TypeError: cannot use a string pattern on a bytes-like object
-
-
Jean-François Fabre over 7 yearsactually you CAN do it, see Jim's answer. You should know it, I know it for at least ... 5 minutes :)
-
Jean-François Fabre over 7 yearsvery nice, I did not know it could be done on bytes. However I'm not sure it's the way to go here. Better go text-only and drop the bytes. Well, maybe it avoids encoding problems.
-
Dartmouth over 7 years@Jean-FrançoisFabre Hmmmmmmm... I do too now ;)
-
Sherlock over 7 yearsthat works , now there is no error. But i am getting "b" prefix to every extracted word. Like this **[b'you', b'and', b'then', b'some']**but i think according to you it should not be there.
-
Dimitris Fasarakis Hilliard over 7 years@Jean-FrançoisFabre you were right ;-). Sherlock, just open the file without specifying
b
as @Jean suggests in his answer.b
prefixed to the mode when opening files results in them being read asbytes
objects, if that isn't what you need, drop it :-) -
Jean-François Fabre over 7 yearsLet me say I'm pleased of the way the things turned out: Jim is the most knowledgeable of us all, he knew about the ability to use regexes for
bytes
, although us, mere mortals, just wanted to use a text file and knew zip about that! So everyone learned something and noone got bashed (I almost deleted my post at some point)