Creating a matrix from CSV file
Solution 1
This is my solution using only the csv library, and working with the index\position in the csv (using an offset which I use to mantain memory on the current row)
import csv
with open('test.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
list_of_list = []
j=0
lines = [line for line in spamreader]
for i in range(len(lines)):
list_ = []
if(len(lines)<=i+j):
break;
first = lines[i+j][0]
while(first == lines[i+j][0]):
list_.append(lines[i+j][2])
j+=1
if(len(lines)<=i+j):
break;
j-=1
list_of_list.append(list(map(float,list_)))
maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")
Anyway the solution posted by Saullo is more elegant
This is my output:
1 2 3 4 5
1 51.0 39.0 40.0 60.0 80.0
2 40.0 28.0 40.0 39.0
3 10.0 20.0 30.0 40.0
I wrote a new version of the code with an iterator, because the csv is too big to fit in memory
import csv
with open('test.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
list_of_list = []
line1 = next(spamreader)
first = line1[0]
list_ = [line1[2]]
for line in spamreader:
while(line[0] == first):
list_.append(line[2])
try:
line = next(spamreader)
except :
break;
list_of_list.append(list(map(float,list_)))
list_ = [line[2]]
first = line[0]
maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")
Anyway probably you need to work on the Matrix in chunks (and doing swaps) because probably the data wan't fit in a 2d array
Solution 2
You can use scipy.sparse.coo_matrix
to load this data very conveniently.
Working with your input:
Input:
1 1 51 9 3
1 2 39 4 4
1 3 40 3 9
1 4 60 2 .
1 5 80 2 .
2 1 40 6 .
2 2 28 4 .
2 3 40 2 .
2 4 39 3 .
3 1 10 . .
3 2 20 . .
3 3 30 . .
3 4 40 . .
. . . . .
You could do:
l, c, v = np.loadtxt('test.txt', skiprows=1).T
m = coo_matrix((v, (l-1, c-1)), shape=(l.max(), c.max()))
Then you can convert the coo_matrix
to a np.ndarray
:
In [9]: m.toarray()
Out[9]:
array([[ 51., 39., 40., 60., 80.],
[ 40., 28., 40., 39., 0.],
[ 10., 20., 30., 40., 0.]])
Solution 3
You should seriously consider using pandas. It is really ideal for this sort of work. I can't give you an actual solution because I don't have your data, but I would try something like the following:
import pandas as pd
df = pd.read_csv('test.csv', usecols=[0,1,2], names=['A', 'B', 'C'])
pd.pivot_table(df, index='A', columns='B', values='C')
The second line imports the data into a pandas DataFrame object (change the names into something more useful for your application). The pivot table creates the matrix you are looking for, and gracefully handles any missing data.
dizzyLife
Updated on June 09, 2022Comments
-
dizzyLife almost 2 years
I've been working on Python for around 2 months now so I have a OK understanding of it.
My goal is to create a matrix using CSV data, then populating that matrix from the data in the 3rd column of that CSV file.
I came up with this code thus far:
import csv import csv def readcsv(csvfile_name): with open(csvfile_name) as csvfile: file=csv.reader(csvfile, delimiter=",") #remove rubbish data in first few rows skiprows = int(input('Number of rows to skip? ')) for i in range(skiprows): _ = next(file) #change strings into integers/floats for z in file: z[:2]=map(int, z[:2]) z[2:]=map(float, z[2:]) print(z[:2]) return
After removing the rubbish data with the above code, the data in the CSV file looks like this:
Input: 1 1 51 9 3 1 2 39 4 4 1 3 40 3 9 1 4 60 2 . 1 5 80 2 . 2 1 40 6 . 2 2 28 4 . 2 3 40 2 . 2 4 39 3 . 3 1 10 . . 3 2 20 . . 3 3 30 . . 3 4 40 . . . . . . .
The output should look like this:
1 2 3 4 . . 1 51 39 40 60 2 40 28 40 39 3 10 20 30 40 . .
There are about a few thousand rows and columns in this CSV file, however I'm only interested is the first 3 columns of the CSV file. So the first and second columns are basically like co-ordinates for the matrix, and then populating the matrix with data in the 3rd column.
After lots of trial and error, I realised that numpy was the way to go with matrices. This is what I tried thus far with example data:
left_column = [1, 2, 1, 2, 1, 2, 1, 2] middle_column = [1, 1, 3, 3, 2, 2, 4, 4] right_column = [1., 5., 3., 7., 2., 6., 4., 8.] import numpy as np m = np.zeros((max(left_column), max(middle_column)), dtype=np.float) for x, y, z in zip(left_column, middle_column, right_column): x -= 1 # Because the indicies are 1-based y -= 1 # Need to be 0-based m[x, y] = z print(m) #: array([[ 1., 2., 3., 4.], #: [ 5., 6., 7., 8.]])
However, it is unrealistic for me to specify all of my data in my script to generate the matrix. I tried using the generators to pull the data out of my CSV file but it didn't work well for me.
I learnt as much numpy as I could, however it appears like it requires my data to already be in matrix form, which it isn't.