Creating a matrix from CSV file

10,950

Solution 1

This is my solution using only the csv library, and working with the index\position in the csv (using an offset which I use to mantain memory on the current row)

import csv

with open('test.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    list_of_list = []
    j=0
    lines = [line for line in spamreader]
    for i in range(len(lines)):
        list_ = []
        if(len(lines)<=i+j):
            break;
        first = lines[i+j][0]
        while(first == lines[i+j][0]):
            list_.append(lines[i+j][2])
            j+=1
            if(len(lines)<=i+j):
                break;
        j-=1
        list_of_list.append(list(map(float,list_)))

maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
    print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")

Anyway the solution posted by Saullo is more elegant

This is my output:

        1       2       3       4       5

1       51.0    39.0    40.0    60.0    80.0

2       40.0    28.0    40.0    39.0

3       10.0    20.0    30.0    40.0

I wrote a new version of the code with an iterator, because the csv is too big to fit in memory

import csv

with open('test.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    list_of_list = []

    line1 = next(spamreader)
    first = line1[0]
    list_ = [line1[2]]
    for line in spamreader:
        while(line[0] == first):
            list_.append(line[2])
            try:
                line = next(spamreader)
            except :
                break;
        list_of_list.append(list(map(float,list_)))
        list_ = [line[2]]
        first = line[0]

maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
    print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")

Anyway probably you need to work on the Matrix in chunks (and doing swaps) because probably the data wan't fit in a 2d array

Solution 2

You can use scipy.sparse.coo_matrix to load this data very conveniently.

Working with your input:

 Input:
   1  1  51 9 3 
   1  2  39 4 4
   1  3  40 3 9
   1  4  60 2 . 
   1  5  80 2 .
   2  1  40 6 .
   2  2  28 4 .
   2  3  40 2 .
   2  4  39 3 . 
   3  1  10 . .
   3  2  20 . .
   3  3  30 . .
   3  4  40 . .
   .  .   . . .

You could do:

l, c, v = np.loadtxt('test.txt', skiprows=1).T
m = coo_matrix((v, (l-1, c-1)), shape=(l.max(), c.max()))

Then you can convert the coo_matrix to a np.ndarray:

In [9]: m.toarray()
Out[9]:
array([[ 51.,  39.,  40.,  60.,  80.],
       [ 40.,  28.,  40.,  39.,   0.],
       [ 10.,  20.,  30.,  40.,   0.]])

Solution 3

You should seriously consider using pandas. It is really ideal for this sort of work. I can't give you an actual solution because I don't have your data, but I would try something like the following:

import pandas as pd
df = pd.read_csv('test.csv', usecols=[0,1,2], names=['A', 'B', 'C'])
pd.pivot_table(df, index='A', columns='B', values='C')

The second line imports the data into a pandas DataFrame object (change the names into something more useful for your application). The pivot table creates the matrix you are looking for, and gracefully handles any missing data.

Share:
10,950
dizzyLife
Author by

dizzyLife

Updated on June 09, 2022

Comments

  • dizzyLife
    dizzyLife almost 2 years

    I've been working on Python for around 2 months now so I have a OK understanding of it.

    My goal is to create a matrix using CSV data, then populating that matrix from the data in the 3rd column of that CSV file.

    I came up with this code thus far:

    import csv
    
    import csv
    def readcsv(csvfile_name):
          with open(csvfile_name) as csvfile:
          file=csv.reader(csvfile, delimiter=",")
    
    #remove rubbish data in first few rows
    
          skiprows = int(input('Number of rows to skip? '))
               for i in range(skiprows):
                    _ = next(file)
    
    #change strings into integers/floats
    
                for z in file:
                     z[:2]=map(int, z[:2])
                     z[2:]=map(float, z[2:])
                     print(z[:2])
            return
    

    After removing the rubbish data with the above code, the data in the CSV file looks like this:

       Input:
       1  1  51 9 3 
       1  2  39 4 4
       1  3  40 3 9
       1  4  60 2 . 
       1  5  80 2 .
       2  1  40 6 .
       2  2  28 4 .
       2  3  40 2 .
       2  4  39 3 . 
       3  1  10 . .
       3  2  20 . .
       3  3  30 . .
       3  4  40 . .
       .  .   . . .
    

    The output should look like this:

          1   2   3   4  .  .
       1  51  39  40  60
       2  40  28  40  39
       3  10  20  30  40
       .
       .
    

    There are about a few thousand rows and columns in this CSV file, however I'm only interested is the first 3 columns of the CSV file. So the first and second columns are basically like co-ordinates for the matrix, and then populating the matrix with data in the 3rd column.

    After lots of trial and error, I realised that numpy was the way to go with matrices. This is what I tried thus far with example data:

      left_column =   [1, 2, 1, 2, 1, 2, 1, 2]
      middle_column = [1, 1, 3, 3, 2, 2, 4, 4]
      right_column =  [1., 5., 3., 7., 2., 6., 4., 8.]
    
      import numpy as np
      m = np.zeros((max(left_column), max(middle_column)), dtype=np.float)
      for x, y, z in zip(left_column, middle_column, right_column):
          x -= 1 # Because the indicies are 1-based
          y -= 1 # Need to be 0-based
          m[x, y] = z
      print(m)
    
      #: array([[ 1., 2., 3., 4.],
      #:        [ 5., 6., 7., 8.]])
    

    However, it is unrealistic for me to specify all of my data in my script to generate the matrix. I tried using the generators to pull the data out of my CSV file but it didn't work well for me.

    I learnt as much numpy as I could, however it appears like it requires my data to already be in matrix form, which it isn't.