Programmatically add column names to numpy ndarray

14,648

Solution 1

The problem is that you are thinking in terms of spreadsheet-like arrays, whereas NumPy does use different concepts.

Here is what you must know about NumPy:

  1. NumPy arrays only contain elements of a single type.
  2. If you need spreadsheet-like "columns", this type must be some tuple-like type. Such arrays are called Structured Arrays, because their elements are structures (i.e. tuples).

In your case, NumPy would thus take your 2-dimensional regular array and produce a one-dimensional array whose type is a 108-element tuple (the spreadsheet array that you are thinking of is 2-dimensional).

These choices were probably made for efficiency reasons: all the elements of an array have the same type and therefore have the same size: they can be accessed, at a low-level, very simply and quickly.

Now, as user545424 showed, there is a simple NumPy answer to what you want to do (genfromtxt() accepts a names argument with column names).

If you want to convert your array from a regular NumPy ndarray to a structured array, you can do:

data.view(dtype=[(n, 'float64') for n in csv_names]).reshape(len(data))

(you were close: you used astype() instead of view()).

You can also check the answers to quite a few Stackoverflow questions, including Converting a 2D numpy array to a structured array and how to convert regular numpy array to record array?.

Solution 2

Unfortunately, I don't know what is going on when you try to add the field names, but I do know that you can build the array you want directly from the file via

data = np.genfromtxt(csv_file, delimiter=',', names=True)

EDIT:

It seems like adding field names only works when the input is a list of tuples:

data = np.array(map(tuple,data), [(n, 'float64') for n in csv_names])
Share:
14,648

Related videos on Youtube

Abe
Author by

Abe

Updated on September 14, 2022

Comments

  • Abe
    Abe over 1 year

    I'm trying to add column names to a numpy ndarray, then select columns by their names. But it doesn't work. I can't tell if the problem occurs when I add the names, or later when I try to call them.

    Here's my code.

    data = np.genfromtxt(csv_file, delimiter=',', dtype=np.float, skip_header=1)
    
    #Add headers
    csv_names = [ s.strip('"') for s in file(csv_file,'r').readline().strip().split(',')]
    data = data.astype(np.dtype( [(n, 'float64') for n in csv_names] ))
    

    Dimension-based diagnostics match what I expect:

    print len(csv_names)
    >> 108
    print data.shape
    >> (1652, 108)
    

    "print data.dtype.names" also returns the expected output.

    But when I start calling columns by their field names, screwy things happen. The "column" is still an array with 108 columns...

    print data["EDUC"].shape
    >> (1652, 108)
    

    ... and it appears to contain more missing values than there are rows in the data set.

    print np.sum(np.isnan(data["EDUC"]))
    >> 27976
    

    Any idea what's going wrong here? Adding headers should be a trivial operation, but I've been fighting this bug for hours. Help!

  • Abe
    Abe almost 12 years
    So is it the case that ndarrays can be referenced by field names if they are cast as tuples OR referenced by field id when cast as arrays---but never both? That seems to be the way it works, but I don't see anything like that in the documentation.
  • user545424
    user545424 almost 12 years
    I'm starting to wonder if this is a bug. It's very strange behavior to have the array constructor act differently based on the type of the nested structure you pass in.
  • Abe
    Abe almost 12 years
    Thanks -- this helps clear things up conceptually. But I still have some questions about this particular case. Here, all of my columns are floats, and I'm going to be doing a lot of matrix multiplication, so I want to keep the 2d-array structure -- no need for structured array. All I want to do is add field names. Is that possible?
  • Abe
    Abe almost 12 years
    NB: genfromtxt imports the csv in numpy's structured tuple format. I tried everything I could think of to import field names in array format, and nothing worked.
  • Bruno Feroleto
    Bruno Feroleto almost 12 years
    @Abe: You can still perform matrix multiplications: the view() is simply another way to look at the same data. So, you can work with both the original data array and the view()ed array at the same time (the first array is 2D, the second is 1D and structured).
  • Bruno Feroleto
    Bruno Feroleto almost 12 years
    @Abe: About your 2nd question: you cannot have "field names in (2D) array format". This concept is not valid in NumPy (this is a spreadsheet concept). You want either a non-structured/named-columns 2D array (your data array), or a 1D structured/named-columns version of it (the result of view() in my answer). I hope this will help clear things up. :)
  • Bruno Feroleto
    Bruno Feroleto almost 12 years
    @user545424: You can understand this behavior if you know the principles on which NumPy is based (you can for instance check my answer). In a nutshell: tuple() is a kind of "fundamental type" (like floats), for NumPy (so you get a kind of structured array, when you pass tuples), whereas passing lists or arrays as input means "add another dimension" to the array (you get an array of numbers, typically).
  • Bruno Feroleto
    Bruno Feroleto almost 12 years
    @Abe: Technically, I don't want to make things more complicated than they are, but note that you can have a 2D (or n-dimensional) structured array. However, each cell will contain a tuple. Example: arr = zeros((3, 5), dtype=[('x', int), ('y', float)]), with structure access like a['x'], which returns a 2D array of integers.