Creating a matrix from CSV file

python csv parsing numpy matrix

10,950

Solution 1

This is my solution using only the csv library, and working with the index\position in the csv (using an offset which I use to mantain memory on the current row)

import csv

with open('test.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    list_of_list = []
    j=0
    lines = [line for line in spamreader]
    for i in range(len(lines)):
        list_ = []
        if(len(lines)<=i+j):
            break;
        first = lines[i+j][0]
        while(first == lines[i+j][0]):
            list_.append(lines[i+j][2])
            j+=1
            if(len(lines)<=i+j):
                break;
        j-=1
        list_of_list.append(list(map(float,list_)))

maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
    print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")

Anyway the solution posted by Saullo is more elegant

This is my output:

        1       2       3       4       5

1       51.0    39.0    40.0    60.0    80.0

2       40.0    28.0    40.0    39.0

3       10.0    20.0    30.0    40.0

I wrote a new version of the code with an iterator, because the csv is too big to fit in memory

import csv

with open('test.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    list_of_list = []

    line1 = next(spamreader)
    first = line1[0]
    list_ = [line1[2]]
    for line in spamreader:
        while(line[0] == first):
            list_.append(line[2])
            try:
                line = next(spamreader)
            except :
                break;
        list_of_list.append(list(map(float,list_)))
        list_ = [line[2]]
        first = line[0]

maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
    print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")

Anyway probably you need to work on the Matrix in chunks (and doing swaps) because probably the data wan't fit in a 2d array

Solution 2

You can use scipy.sparse.coo_matrix to load this data very conveniently.

Working with your input:

 Input:
   1  1  51 9 3 
   1  2  39 4 4
   1  3  40 3 9
   1  4  60 2 . 
   1  5  80 2 .
   2  1  40 6 .
   2  2  28 4 .
   2  3  40 2 .
   2  4  39 3 . 
   3  1  10 . .
   3  2  20 . .
   3  3  30 . .
   3  4  40 . .
   .  .   . . .

You could do:

l, c, v = np.loadtxt('test.txt', skiprows=1).T
m = coo_matrix((v, (l-1, c-1)), shape=(l.max(), c.max()))

Then you can convert the coo_matrix to a np.ndarray:

In [9]: m.toarray()
Out[9]:
array([[ 51.,  39.,  40.,  60.,  80.],
       [ 40.,  28.,  40.,  39.,   0.],
       [ 10.,  20.,  30.,  40.,   0.]])

Solution 3

You should seriously consider using pandas. It is really ideal for this sort of work. I can't give you an actual solution because I don't have your data, but I would try something like the following:

import pandas as pd
df = pd.read_csv('test.csv', usecols=[0,1,2], names=['A', 'B', 'C'])
pd.pivot_table(df, index='A', columns='B', values='C')

The second line imports the data into a pandas DataFrame object (change the names into something more useful for your application). The pivot table creates the matrix you are looking for, and gracefully handles any missing data.

10,950

Author by

dizzyLife

Updated on June 09, 2022

Comments

dizzyLife almost 2 years

I've been working on Python for around 2 months now so I have a OK understanding of it.

My goal is to create a matrix using CSV data, then populating that matrix from the data in the 3rd column of that CSV file.

I came up with this code thus far:

import csv

import csv
def readcsv(csvfile_name):
      with open(csvfile_name) as csvfile:
      file=csv.reader(csvfile, delimiter=",")

#remove rubbish data in first few rows

      skiprows = int(input('Number of rows to skip? '))
           for i in range(skiprows):
                _ = next(file)

#change strings into integers/floats

            for z in file:
                 z[:2]=map(int, z[:2])
                 z[2:]=map(float, z[2:])
                 print(z[:2])
        return

After removing the rubbish data with the above code, the data in the CSV file looks like this:

   Input:
   1  1  51 9 3 
   1  2  39 4 4
   1  3  40 3 9
   1  4  60 2 . 
   1  5  80 2 .
   2  1  40 6 .
   2  2  28 4 .
   2  3  40 2 .
   2  4  39 3 . 
   3  1  10 . .
   3  2  20 . .
   3  3  30 . .
   3  4  40 . .
   .  .   . . .

The output should look like this:

      1   2   3   4  .  .
   1  51  39  40  60
   2  40  28  40  39
   3  10  20  30  40
   .
   .

There are about a few thousand rows and columns in this CSV file, however I'm only interested is the first 3 columns of the CSV file. So the first and second columns are basically like co-ordinates for the matrix, and then populating the matrix with data in the 3rd column.

After lots of trial and error, I realised that numpy was the way to go with matrices. This is what I tried thus far with example data:

  left_column =   [1, 2, 1, 2, 1, 2, 1, 2]
  middle_column = [1, 1, 3, 3, 2, 2, 4, 4]
  right_column =  [1., 5., 3., 7., 2., 6., 4., 8.]

  import numpy as np
  m = np.zeros((max(left_column), max(middle_column)), dtype=np.float)
  for x, y, z in zip(left_column, middle_column, right_column):
      x -= 1 # Because the indicies are 1-based
      y -= 1 # Need to be 0-based
      m[x, y] = z
  print(m)

  #: array([[ 1., 2., 3., 4.],
  #:        [ 5., 6., 7., 8.]])

However, it is unrealistic for me to specify all of my data in my script to generate the matrix. I tried using the generators to pull the data out of my CSV file but it didn't work well for me.

I learnt as much numpy as I could, however it appears like it requires my data to already be in matrix form, which it isn't.