reading csv files in scipy/numpy in Python
16,969
Solution 1
Check out the python CSV module: http://docs.python.org/library/csv.html
import csv
reader = csv.reader(open("myfile.csv", "rb"),
delimiter='\t', quoting=csv.QUOTE_NONE)
header = []
records = []
fields = 16
if thereIsAHeader: header = reader.next()
for row, record in enumerate(reader):
if len(record) != fields:
print "Skipping malformed record %i, contains %i fields (%i expected)" %
(record, len(record), fields)
else:
records.append(record)
# do numpy stuff.
Solution 2
May I ask why you're not using the built-in csv reader? http://docs.python.org/library/csv.html
I've used it very effectively with numpy/scipy. I would share my code but unfortunately it's owned by my employer, but it should be very straightforward to write your own.
Author by
Admin
Updated on June 04, 2022Comments
-
Admin almost 2 years
I am having trouble reading a csv file, delimited by tabs, in python. I use the following function:
def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True): """ Parse a file name into an array. Return the array and additional header lines. By default, parse the header lines into dictionaries, assuming the parameters are numeric, using 'parse_header'. """ f = open(filename, 'r') skipped_rows = [] for n in range(skiprows): header_line = f.readline().strip() if raw_header: skipped_rows.append(header_line) else: skipped_rows.append(parse_header(header_line)) f.close() if missing: data = genfromtxt(filename, dtype=None, names=with_header, deletechars='', skiprows=skiprows, missing=missing) else: if delimiter != '\t': data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter, deletechars='', skiprows=skiprows) else: data = genfromtxt(filename, dtype=None, names=with_header, deletechars='', skiprows=skiprows) if data.ndim == 0: data = array([data.item()]) return (data, skipped_rows)
the problem is that genfromtxt complains about my files, e.g. with the error:
Line #27100 (got 12 columns instead of 16)
I am not sure where these errors come from. Any ideas?
Here's an example file that causes the problem:
#Gene 120-1 120-3 120-4 30-1 30-3 30-4 C-1 C-2 C-5 genesymbol genedesc ENSMUSG00000000001 7.32 9.5 7.76 7.24 11.35 8.83 6.67 11.35 7.12 Gnai3 guanine nucleotide binding protein alpha ENSMUSG00000000003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn probasin
Is there a better way to write a generic csv2array function? thanks.
-
Admin almost 14 yearsI added an example file that leads to the error -- it looks to me like it has the right number of columns but for some reason it thinks it has 16 columns. Any idea what causes this?
-
Admin almost 14 yearsthis does not make a numpy array out of the result, unfortunately
-
Nick T almost 14 yearsYou can do whatever you like with the data in the loop body; there it's a list broken up by delimiter. You could check if it's as long as you expect, (in edited example), or do validation on each field to make sure you're not passing garbage into your numpy array.