Text File Parsing with Python

225,870

Solution 1

I would use a for loop to iterate over the lines in the text file:

for line in my_text:
    outputfile.writelines(data_parser(line, reps))

If you want to read the file line-by-line instead of loading the whole thing at the start of the script you could do something like this:

inputfile = open('test.dat')
outputfile = open('test.csv', 'w')

# sample text string, just for demonstration to let you know how the data looks like
# my_text = '"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636'

# dictionary definition 0-, 1- etc. are there to parse the date block delimited with dashes, and make sure the negative numbers are not effected
reps = {'"NAN"':'NAN', '"':'', '0-':'0,','1-':'1,','2-':'2,','3-':'3,','4-':'4,','5-':'5,','6-':'6,','7-':'7,','8-':'8,','9-':'9,', ' ':',', ':':',' }

for i in range(4): inputfile.next() # skip first four lines
for line in inputfile:
    outputfile.writelines(data_parser(line, reps))

inputfile.close()
outputfile.close()

Solution 2

From the accepted answer, it looks like your desired behaviour is to turn

skip 0
skip 1
skip 2
skip 3
"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636

into

2012,06,23,03,09,13.23,4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,NAN,-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636

If that's right, then I think something like

import csv

with open("test.dat", "rb") as infile, open("test.csv", "wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile, quoting=False)
    for i, line in enumerate(reader):
        if i < 4: continue
        date = line[0].split()
        day = date[0].split('-')
        time = date[1].split(':')
        newline = day + time + line[1:]
        writer.writerow(newline)

would be a little simpler than the reps stuff.

Solution 3

There are a few ways to go about this. One option would be to use inputfile.read() instead of inputfile.readlines() - you'd need to write separate code to strip the first four lines, but if you want the final output as a single string anyway, this might make the most sense.

A second, simpler option would be to rejoin the strings after striping the first four lines with my_text = ''.join(my_text). This is a little inefficient, but if speed isn't a major concern, the code will be simplest.

Finally, if you actually want the output as a list of strings instead of a single string, you can just modify your data parser to iterate over the list. That might looks something like this:

def data_parser(lines, dic):
    for i, j in dic.iteritems():
        for (k, line) in enumerate(lines):
            lines[k] = line.replace(i, j)
    return lines
Share:
225,870

Related videos on Youtube

marillion
Author by

marillion

Updated on July 09, 2022

Comments

  • marillion
    marillion almost 2 years

    I am trying to parse a series of text files and save them as CSV files using Python (2.7.3). All text files have a 4 line long header which needs to be stripped out. The data lines have various delimiters including " (quote), - (dash), : column, and blank space. I found it a pain to code it in C++ with all these different delimiters, so I decided to try it in Python hearing it is relatively easier to do compared to C/C++.

    I wrote a piece of code to test it for a single line of data and it works, however, I could not manage to make it work for the actual file. For parsing a single line I was using the text object and "replace" method. It looks like my current implementation reads the text file as a list, and there is no replace method for the list object.

    Being a novice in Python, I got stuck at this point. Any input would be appreciated!

    Thanks!

    # function for parsing the data
    def data_parser(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i,j)
    return text
    
    # open input/output files
    
    inputfile = open('test.dat')
    outputfile = open('test.csv', 'w')
    
    my_text = inputfile.readlines()[4:] #reads to whole text file, skipping first 4 lines
    
    
    # sample text string, just for demonstration to let you know how the data looks like
    # my_text = '"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636'
    
    # dictionary definition 0-, 1- etc. are there to parse the date block delimited with dashes, and make sure the negative numbers are not effected
    reps = {'"NAN"':'NAN', '"':'', '0-':'0,','1-':'1,','2-':'2,','3-':'3,','4-':'4,','5-':'5,','6-':'6,','7-':'7,','8-':'8,','9-':'9,', ' ':',', ':':',' }
    
    txt = data_parser(my_text, reps)
    outputfile.writelines(txt)
    
    inputfile.close()
    outputfile.close()
    
    • Diego Allen
      Diego Allen over 11 years
      You should attach a copy of the file you need to parse and the expected output, that way it will be easier to help you.
  • marillion
    marillion over 11 years
    thanks! what would be the best way to skip the first 4 lines then? To admit, I could not find a way to do it, that's why I decided to read the whole thing. Should I write the file except the first 4 lines to another file to run the loop you have above? I bet there should be an easier way though. EDIT: oh wait, I think you mean replacing the line txt = data_parser(my_text, reps) with the loop you have above.
  • Joe Day
    Joe Day over 11 years
    You've already skipped the first 4 lines with the line my_text = inputfile.readlines()[4:], if you would rather read the file line-by-line and not load the whole thing in to ram at the beginning of the script I can update my answer.
  • marillion
    marillion over 11 years
    Sorry, I got it wrong at the first place (see my EDIT above), but thanks, it works perfectly!!! Now, I would be very glad to learn about the "read line-parse-write line (line-by-line)" way of doing things. I have some files large file with a size of +500MB, which may mess up things. Could you update your answer with a second example?
  • Joe Day
    Joe Day over 11 years
    I updated my answer with a version that reads the input file a line at a time.
  • marillion
    marillion over 11 years
    Greatly appreciated, thank you! for i in range(4): inputfile.next() was what I was looking for before deciding to read the whole thing by the way!
  • marillion
    marillion over 11 years
    I tried using the csv module before coming up with the reps bit, but found the documentation a little bit confusing. Your example makes it much clear. I will try this, just for the sake of learning too. 1. do you eliminate quotes in the text file by quoting=False? 2. could you verify my understanding? date line in the code splits the date portion first and becomes a list by itself, day and time are splitted next, and rest of the line is appended to the day and time. I am not sure how it automatically adds commas though, in your newline = day + time + line[1] line. Hmm...
  • DSM
    DSM over 11 years
    @marillon: (1) Yes, there are lots of different quote options. I think it's a little strange to get rid of them all, actually, but maybe you need that for some reason. (2) Yep. Commas aren't added in newline -- that's just a list. writerow is the writer method which adds commas to the output string (or tabs or any other delimiter we wanted) and would handle quoting if we wanted that.
  • marillion
    marillion over 11 years
    Ok, I think I got it. Plus, you never needed to parse the data portion of the line at all, since it was already comma separated. Good information, thanks a lot!