How to clean large malformed CSV file using Python
This is a comment about how i would tackle this.
For each line:
I can easily identify start and of end of certain groups:
- Hostname - there is only one
- usernames - read these until you meet something that does not look like a username (comma delimited)
- ip address - read these until you meet a timestamp - identified with a pattern match - be aware these are separated by space rather than comma. The end of group is identified by the trailing comma.
- timestamp - easy to identify with a pattern match
- test1, test2, test3 - certain to be there as comma delimted fields
Notes: I would use the 'pattern' matches to enable me to identify i have the correct thing in the correct place. It enables spotting errors sooner.
Related videos on Youtube
spork_user
Updated on June 06, 2022Comments
-
spork_user almost 2 years
I'm attempting to use Python 2.7.5 to clean up a malformed CSV file. The CSV file is fairly large (over 1GB). The first row of the file correctly lists the column headings, but after that each field is on a new line (unless it is blank) and some fields are multi-line. The multi-line fields are not surrounded by quotes, but need to be surrounded by quotes in the output. The number of columns is static and known. The pattern in the sample input provided is repeated throughout the length of the file.
Input file (sample):
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3 my_hostname ,my_username ,10.0.0.1 192.168.1.1 ,2015-02-11 13:41:54 -0600 ,,true ,false my_2nd_hostname ,my_2nd_username ,10.0.0.2 192.168.1.2 ,2015-02-11 14:04:41 -0600 ,true ,,false
Desired output:
Hostname,Username,IP Addresses,Timestamp,Test1,Test2,Test3 my_hostname,my_username,"10.0.0.1 192.168.1.1",2015-02-11 13:41:54 -0600,,true,false my_2nd_hostname,my_2nd_username,"10.0.0.2 192.168.1.2",2015-02-11 14:04:41 -0600,true,,false
I've gone down a couple paths that address one of the issues only to realize that it doesn't handle another aspect of the malformed data. I would appreciate if anyone could please help me identify an efficient way to clean up this file.
Thanks
EDIT
I have several code scraps from going down different paths, but here is the current iteration. It isn't pretty, just a bunch of hacks to try and figure this out.
import csv inputfile = open('input.csv', 'r') outputfile_1 = open('output.csv', 'w') counter = 1 for line in inputfile: #Skip header row if counter == 1: outputfile_1.write(line) counter = counter + 1 else: line = line.replace('\r', '').replace('\n', '') outputfile_1.write(line) inputfile.close() outputfile_1.close() with open('output.csv', 'r') as f: text = f.read() comma_count = text.count(',') #comma_count/6 = total number of rows #need to insert a newline after the field contents after every 6th comma #unfortunately the last field of the row and the first field of the next row are now rammed up together becaue of the newline replaces above... #then process as normal CSV #one path I started to go down... but this isn't even functional groups = text.split(',') counter2 = 1 while (counter2 <= comma_count/6): line = ','.join(groups[:(6*counter2)]), ','.join(groups[(6*counter2):]) print line counter2 = counter2 + 1
EDIT 2
Thanks to @DSM and @Ryan Vincent for getting me on the right track. Using their ideas I made the following code, which seems to correct my malformed CSV. I'm sure there are many places for improvement though, which I would happily accept.
import csv import re outputfile_1 = open('output.csv', 'wb') wr = csv.writer(outputfile_1, quoting=csv.QUOTE_ALL) with open('input.csv', 'r') as f: text = f.read() comma_indices = [m.start() for m in re.finditer(',', text)] #Find all the commas - the fields are between them cursor = 0 field_counter = 1 row_count = 0 csv_row = [] for index in comma_indices: newrowflag = False if "\r" in text[cursor:index]: #This chunk has two fields, the last of one row and first of the next next_field=text[cursor:index].split('\r') next_field_trimmed = next_field[0].replace('\n',' ').rstrip().lstrip() csv_row.extend([next_field_trimmed]) #Add the last field of this row #Reset the cursor to be in the middle of the chuck (after the last field and before the next) #And set a flag that we need to start the next csvrow before we move on to the next comma index cursor = cursor+text[cursor:index].index('\r')+1 newrowflag = True else: next_field_trimmed = text[cursor:index].replace('\n',' ').rstrip().lstrip() csv_row.extend([next_field_trimmed]) #Advance the cursor to the character after the comma to start the next field cursor = index + 1 #If we've done 7 fields then we've finished the row if field_counter%7==0: row_count = row_count + 1 wr.writerow(csv_row) #Reset csv_row = [] #If the last chunk had 2 fields in it... if newrowflag: next_field_trimmed = next_field[1].replace('\n',' ').rstrip().lstrip() csv_row.extend([next_field_trimmed]) field_counter = field_counter + 1 field_counter = field_counter + 1 #Write the last row wr.writerow(csv_row) outputfile_1.close() # Process output.csv as normal CSV file...
-
DSM about 9 yearsThis doesn't seem too bad; in each group, you're reading seven lines and then reformatting them a bit before printing them out. Is there a more specific issue with your code which is causing problems?
-