ASCII string as dtype for numpy array of strings in Python 3
You can use the 'S' typestr:
>>> np.array(['Hello', 'World'], dtype='S')
array([b'Hello', b'World'],
dtype='|S5')
Also in 2.6/2.7 str
is aliased to bytes
(or np.bytes_
):
>>> np.dtype((bytes, 1)) # 2.7
dtype('|S1')
>>> np.dtype((bytes, 1)) # 3.2
dtype('|S1')
And b''
literals are supported:
>>> np.array([b'Hello', b'World']) # 2.7
array(['Hello', 'World'],
dtype='|S5')
>>> np.array([b'Hello', b'World']) # 3.2
array([b'Hello', b'World'],
dtype='|S5')
Lev Levitsky
A graduate of Moscow Institute of Physics and Technology (B.S. and M.S. in applied physics and math). Now I focus on utilizing Python for the needs of proteomics. Most of our results are available as Python packages pyteomics and pyteomics.biolccc.
Updated on June 13, 2022Comments
-
Lev Levitsky almost 2 years
NumPy's
string
dtype seems to correspond to Python'sstr
and thus to change between Python 2.x and 3.x:In Python 2.7:
In [1]: import numpy as np In [2]: np.dtype((np.str_, 1)).itemsize Out[2]: 1 In [3]: np.dtype((np.unicode_, 1)).itemsize Out[3]: 4
In Python 3.3:
In [2]: np.dtype((np.str_, 1)).itemsize Out[2]: 4
The version of NumPy is 1.7.0 in both cases.
I'm writing some code that I want to work on both Python versions, and I want an array of ASCII strings (4x memory overhead is not acceptable). So the questions are:
- How do I define a dtype for an ASCII string of certain length (with 1 byte per char) in Python 3?
- How do I do it in a way that also works in Python 2?
- Bonus question: Can I limit the alphabet even further, e.g. to
ascii_uppercase
, and save a bit or two per char?
Something that I see as the potential answer are character arrays for the first question (i.e. have an array of character arrays instead of an array of strings). Seems like I can specify the item size when constructing one:
chararray(shape, itemsize=1, unicode=False, buffer=None, offset=0, strides=None, order=None)
Update: nah, the
itemsize
is actually the number of characters. But there's stillunicode=False
.Is that the way to go?
Will it answer the last question, too?
And how do I actually use it as
dtype
?