ASCII string as dtype for numpy array of strings in Python 3

10,029

You can use the 'S' typestr:

>>> np.array(['Hello', 'World'], dtype='S')
array([b'Hello', b'World'], 
      dtype='|S5')

Also in 2.6/2.7 str is aliased to bytes (or np.bytes_):

>>> np.dtype((bytes, 1)) # 2.7
dtype('|S1')
>>> np.dtype((bytes, 1)) # 3.2
dtype('|S1')

And b'' literals are supported:

>>> np.array([b'Hello', b'World']) # 2.7
array(['Hello', 'World'], 
      dtype='|S5')
>>> np.array([b'Hello', b'World']) # 3.2
array([b'Hello', b'World'], 
      dtype='|S5')
Share:
10,029
Lev Levitsky
Author by

Lev Levitsky

A graduate of Moscow Institute of Physics and Technology (B.S. and M.S. in applied physics and math). Now I focus on utilizing Python for the needs of proteomics. Most of our results are available as Python packages pyteomics and pyteomics.biolccc.

Updated on June 13, 2022

Comments

  • Lev Levitsky
    Lev Levitsky almost 2 years

    NumPy's string dtype seems to correspond to Python's str and thus to change between Python 2.x and 3.x:

    In Python 2.7:

    In [1]: import numpy as np
    
    In [2]: np.dtype((np.str_, 1)).itemsize
    Out[2]: 1
    
    In [3]: np.dtype((np.unicode_, 1)).itemsize
    Out[3]: 4
    

    In Python 3.3:

    In [2]: np.dtype((np.str_, 1)).itemsize
    Out[2]: 4
    

    The version of NumPy is 1.7.0 in both cases.

    I'm writing some code that I want to work on both Python versions, and I want an array of ASCII strings (4x memory overhead is not acceptable). So the questions are:

    • How do I define a dtype for an ASCII string of certain length (with 1 byte per char) in Python 3?
    • How do I do it in a way that also works in Python 2?
    • Bonus question: Can I limit the alphabet even further, e.g. to ascii_uppercase, and save a bit or two per char?

    Something that I see as the potential answer are character arrays for the first question (i.e. have an array of character arrays instead of an array of strings). Seems like I can specify the item size when constructing one:

    chararray(shape, itemsize=1, unicode=False, buffer=None, offset=0,
              strides=None, order=None)
    

    Update: nah, the itemsize is actually the number of characters. But there's still unicode=False.

    Is that the way to go?

    Will it answer the last question, too?

    And how do I actually use it as dtype?