How to clean large malformed CSV file using Python

10,418

This is a comment about how i would tackle this.

For each line:

I can easily identify start and of end of certain groups:

  • Hostname - there is only one
  • usernames - read these until you meet something that does not look like a username (comma delimited)
  • ip address - read these until you meet a timestamp - identified with a pattern match - be aware these are separated by space rather than comma. The end of group is identified by the trailing comma.
  • timestamp - easy to identify with a pattern match
  • test1, test2, test3 - certain to be there as comma delimted fields

Notes: I would use the 'pattern' matches to enable me to identify i have the correct thing in the correct place. It enables spotting errors sooner.

Share:
10,418

Related videos on Youtube

spork_user
Author by

spork_user

Updated on June 06, 2022

Comments

  • spork_user
    spork_user almost 2 years

    I'm attempting to use Python 2.7.5 to clean up a malformed CSV file. The CSV file is fairly large (over 1GB). The first row of the file correctly lists the column headings, but after that each field is on a new line (unless it is blank) and some fields are multi-line. The multi-line fields are not surrounded by quotes, but need to be surrounded by quotes in the output. The number of columns is static and known. The pattern in the sample input provided is repeated throughout the length of the file.

    Input file (sample):

    Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
    my_hostname
    ,my_username
    ,10.0.0.1
    192.168.1.1
    ,2015-02-11 13:41:54 -0600
    ,,true
    ,false
    my_2nd_hostname
    ,my_2nd_username
    ,10.0.0.2
    192.168.1.2
    ,2015-02-11 14:04:41 -0600
    ,true
    ,,false
    

    Desired output:

    Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3
    my_hostname,my_username,"10.0.0.1 192.168.1.1",2015-02-11 13:41:54 -0600,,true,false
    my_2nd_hostname,my_2nd_username,"10.0.0.2 192.168.1.2",2015-02-11 14:04:41 -0600,true,,false
    

    I've gone down a couple paths that address one of the issues only to realize that it doesn't handle another aspect of the malformed data. I would appreciate if anyone could please help me identify an efficient way to clean up this file.

    Thanks

    EDIT

    I have several code scraps from going down different paths, but here is the current iteration. It isn't pretty, just a bunch of hacks to try and figure this out.

    import csv
    
    inputfile = open('input.csv', 'r')
    outputfile_1 = open('output.csv', 'w')
    
    counter = 1
    for line in inputfile:
        #Skip header row
        if counter == 1:
            outputfile_1.write(line)
            counter = counter + 1
        else:
            line = line.replace('\r', '').replace('\n', '')
            outputfile_1.write(line)
    
    inputfile.close()
    outputfile_1.close()
    
    with open('output.csv', 'r') as f:
        text = f.read()
    
        comma_count = text.count(',') #comma_count/6 = total number of rows
    
        #need to insert a newline after the field contents after every 6th comma
        #unfortunately the last field of the row and the first field of the next row are now rammed up together becaue of the newline replaces above...
        #then process as normal CSV
    
        #one path I started to go down... but this isn't even functional
        groups = text.split(',')
    
        counter2 = 1
        while (counter2 <= comma_count/6):
            line = ','.join(groups[:(6*counter2)]), ','.join(groups[(6*counter2):])
            print line
            counter2 = counter2 + 1
    

    EDIT 2

    Thanks to @DSM and @Ryan Vincent for getting me on the right track. Using their ideas I made the following code, which seems to correct my malformed CSV. I'm sure there are many places for improvement though, which I would happily accept.

    import csv
    import re
    
    outputfile_1 = open('output.csv', 'wb')
    wr = csv.writer(outputfile_1, quoting=csv.QUOTE_ALL)
    
    with open('input.csv', 'r') as f:
        text = f.read()
        comma_indices = [m.start() for m in re.finditer(',', text)] #Find all the commas - the fields are between them
    
        cursor = 0
        field_counter = 1
        row_count = 0
        csv_row = []
    
        for index in comma_indices:
            newrowflag = False
    
            if "\r" in text[cursor:index]:
                #This chunk has two fields, the last of one row and first of the next
                next_field=text[cursor:index].split('\r')
                next_field_trimmed = next_field[0].replace('\n',' ').rstrip().lstrip()
                csv_row.extend([next_field_trimmed]) #Add the last field of this row
    
                #Reset the cursor to be in the middle of the chuck (after the last field and before the next)
                #And set a flag that we need to start the next csvrow before we move on to the next comma index
                cursor = cursor+text[cursor:index].index('\r')+1
                newrowflag = True
            else:
                next_field_trimmed = text[cursor:index].replace('\n',' ').rstrip().lstrip()
                csv_row.extend([next_field_trimmed])
    
                #Advance the cursor to the character after the comma to start the next field
                cursor = index + 1
    
            #If we've done 7 fields then we've finished the row
            if field_counter%7==0:
                row_count = row_count + 1
                wr.writerow(csv_row)
    
                #Reset
                csv_row = []
    
                #If the last chunk had 2 fields in it...
                if newrowflag:
                    next_field_trimmed = next_field[1].replace('\n',' ').rstrip().lstrip()
                    csv_row.extend([next_field_trimmed])
                    field_counter = field_counter + 1
    
            field_counter = field_counter + 1
        #Write the last row
        wr.writerow(csv_row)
    
    outputfile_1.close()
    
    # Process output.csv as normal CSV file...    
    
    • DSM
      DSM about 9 years
      This doesn't seem too bad; in each group, you're reading seven lines and then reformatting them a bit before printing them out. Is there a more specific issue with your code which is causing problems?