Programmatically add column names to numpy ndarray
Solution 1
The problem is that you are thinking in terms of spreadsheet-like arrays, whereas NumPy does use different concepts.
Here is what you must know about NumPy:
- NumPy arrays only contain elements of a single type.
- If you need spreadsheet-like "columns", this type must be some tuple-like type. Such arrays are called Structured Arrays, because their elements are structures (i.e. tuples).
In your case, NumPy would thus take your 2-dimensional regular array and produce a one-dimensional array whose type is a 108-element tuple (the spreadsheet array that you are thinking of is 2-dimensional).
These choices were probably made for efficiency reasons: all the elements of an array have the same type and therefore have the same size: they can be accessed, at a low-level, very simply and quickly.
Now, as user545424 showed, there is a simple NumPy answer to what you want to do (genfromtxt()
accepts a names
argument with column names).
If you want to convert your array from a regular NumPy ndarray to a structured array, you can do:
data.view(dtype=[(n, 'float64') for n in csv_names]).reshape(len(data))
(you were close: you used astype()
instead of view()
).
You can also check the answers to quite a few Stackoverflow questions, including Converting a 2D numpy array to a structured array and how to convert regular numpy array to record array?.
Solution 2
Unfortunately, I don't know what is going on when you try to add the field names, but I do know that you can build the array you want directly from the file via
data = np.genfromtxt(csv_file, delimiter=',', names=True)
EDIT:
It seems like adding field names only works when the input is a list of tuples:
data = np.array(map(tuple,data), [(n, 'float64') for n in csv_names])
Related videos on Youtube
Abe
Updated on September 14, 2022Comments
-
Abe over 1 year
I'm trying to add column names to a numpy ndarray, then select columns by their names. But it doesn't work. I can't tell if the problem occurs when I add the names, or later when I try to call them.
Here's my code.
data = np.genfromtxt(csv_file, delimiter=',', dtype=np.float, skip_header=1) #Add headers csv_names = [ s.strip('"') for s in file(csv_file,'r').readline().strip().split(',')] data = data.astype(np.dtype( [(n, 'float64') for n in csv_names] ))
Dimension-based diagnostics match what I expect:
print len(csv_names) >> 108 print data.shape >> (1652, 108)
"print data.dtype.names" also returns the expected output.
But when I start calling columns by their field names, screwy things happen. The "column" is still an array with 108 columns...
print data["EDUC"].shape >> (1652, 108)
... and it appears to contain more missing values than there are rows in the data set.
print np.sum(np.isnan(data["EDUC"])) >> 27976
Any idea what's going wrong here? Adding headers should be a trivial operation, but I've been fighting this bug for hours. Help!
-
Abe almost 12 yearsSo is it the case that ndarrays can be referenced by field names if they are cast as tuples OR referenced by field id when cast as arrays---but never both? That seems to be the way it works, but I don't see anything like that in the documentation.
-
user545424 almost 12 yearsI'm starting to wonder if this is a bug. It's very strange behavior to have the array constructor act differently based on the type of the nested structure you pass in.
-
Abe almost 12 yearsThanks -- this helps clear things up conceptually. But I still have some questions about this particular case. Here, all of my columns are floats, and I'm going to be doing a lot of matrix multiplication, so I want to keep the 2d-array structure -- no need for structured array. All I want to do is add field names. Is that possible?
-
Abe almost 12 yearsNB: genfromtxt imports the csv in numpy's structured tuple format. I tried everything I could think of to import field names in array format, and nothing worked.
-
Bruno Feroleto almost 12 years@Abe: You can still perform matrix multiplications: the
view()
is simply another way to look at the same data. So, you can work with both the original data array and theview()
ed array at the same time (the first array is 2D, the second is 1D and structured). -
Bruno Feroleto almost 12 years@Abe: About your 2nd question: you cannot have "field names in (2D) array format". This concept is not valid in NumPy (this is a spreadsheet concept). You want either a non-structured/named-columns 2D array (your
data
array), or a 1D structured/named-columns version of it (the result ofview()
in my answer). I hope this will help clear things up. :) -
Bruno Feroleto almost 12 years@user545424: You can understand this behavior if you know the principles on which NumPy is based (you can for instance check my answer). In a nutshell: tuple() is a kind of "fundamental type" (like floats), for NumPy (so you get a kind of structured array, when you pass tuples), whereas passing lists or arrays as input means "add another dimension" to the array (you get an array of numbers, typically).
-
Bruno Feroleto almost 12 years@Abe: Technically, I don't want to make things more complicated than they are, but note that you can have a 2D (or n-dimensional) structured array. However, each cell will contain a tuple. Example:
arr = zeros((3, 5), dtype=[('x', int), ('y', float)])
, with structure access likea['x']
, which returns a 2D array of integers.