ASCII string as dtype for numpy array of strings in Python 3

python arrays string numpy python-3.x

10,029

You can use the 'S' typestr:

>>> np.array(['Hello', 'World'], dtype='S')
array([b'Hello', b'World'], 
      dtype='|S5')

Also in 2.6/2.7 str is aliased to bytes (or np.bytes_):

>>> np.dtype((bytes, 1)) # 2.7
dtype('|S1')
>>> np.dtype((bytes, 1)) # 3.2
dtype('|S1')

And b'' literals are supported:

>>> np.array([b'Hello', b'World']) # 2.7
array(['Hello', 'World'], 
      dtype='|S5')
>>> np.array([b'Hello', b'World']) # 3.2
array([b'Hello', b'World'], 
      dtype='|S5')

10,029

Author by

Lev Levitsky

A graduate of Moscow Institute of Physics and Technology (B.S. and M.S. in applied physics and math). Now I focus on utilizing Python for the needs of proteomics. Most of our results are available as Python packages pyteomics and pyteomics.biolccc.

Updated on June 13, 2022

Comments

Lev Levitsky almost 2 years
NumPy's string dtype seems to correspond to Python's str and thus to change between Python 2.x and 3.x:

In Python 2.7:
```
In [1]: import numpy as np

In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 1

In [3]: np.dtype((np.unicode_, 1)).itemsize
Out[3]: 4
```
In Python 3.3:
```
In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 4
```
The version of NumPy is 1.7.0 in both cases.

I'm writing some code that I want to work on both Python versions, and I want an array of ASCII strings (4x memory overhead is not acceptable). So the questions are:
- How do I define a dtype for an ASCII string of certain length (with 1 byte per char) in Python 3?
- How do I do it in a way that also works in Python 2?
- Bonus question: Can I limit the alphabet even further, e.g. to ascii_uppercase, and save a bit or two per char?
Something that I see as the potential answer are character arrays for the first question (i.e. have an array of character arrays instead of an array of strings). Seems like I can specify the item size when constructing one:
```
chararray(shape, itemsize=1, unicode=False, buffer=None, offset=0,
          strides=None, order=None)
```
Update: nah, the itemsize is actually the number of characters. But there's still unicode=False.

Is that the way to go?

Will it answer the last question, too?

And how do I actually use it as dtype?