Parsing data from text file

35,015

Solution 1

It is very far from CSV, actually.

You can use the file as an iterator; the following generator function yields complete sections:

def load_sections(filename):
    with open(filename, 'r') as infile:
        line = ''
        while True:
            while not line.startswith('****'): 
                line = next(infile)  # raises StopIteration, ending the generator
                continue  # find next entry

            entry = {}
            for line in infile:
                line = line.strip()
                if not line: break

                key, value = map(str.strip, line.split(':', 1))
                entry[key] = value

            yield entry

This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner while and for loops do all the real work; first skip lines until a **** header section is found (otherwise discarded), then loop over all non-empty lines to create a section.

Use the function in a loop:

for section in load_sections(filename):
    print section

Repeating your sample data in a text file results in:

>>> for section in load_sections('/tmp/test.txt'):
...     print section
... 
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}

You can add some data converters to that if you want to; a mapping of key to callable would do:

converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int}

then in the generator function, instead of entry[key] = value do entry[key] = converters.get(key, lambda v: v)(value).

Solution 2

my_file:

******** ENTRY 01 ********
ID:                  01
Data1:               0.1834869385E-002
Data2:              10.9598489301
Data3:              -0.1091356549E+001
Data4:                715

ID:                  02
Data1:               0.18348674325E-012
Data2:              10.9598489301
Data3:              0.0
Data4:                5748

ID:                  03
Data1:               20.1834869385E-002
Data2:              10.954576354
Data3:              10.13476858762435E+001
Data4:                7456

Python script:

import re

with open('my_file', 'r') as f:
    data  = list()
    group = dict()
    for key, value in re.findall(r'(.*):\s*([\dE+-.]+)', f.read()):
        if key in group:
            data.append(group)
            group = dict()
        group[key] = value
    data.append(group)

print data

Printed output:

[
    {
        'Data4': '715',
        'Data1': '0.1834869385E-002',
        'ID': '01',
        'Data3': '-0.1091356549E+001',
        'Data2': '10.9598489301'
    },
    {
        'Data4': '5748',
        'Data1': '0.18348674325E-012',
        'ID': '02',
        'Data3': '0.0',
        'Data2': '10.9598489301'
    },
    {
        'Data4': '7456',
        'Data1': '20.1834869385E-002',
        'ID': '03',
        'Data3': '10.13476858762435E+001',
        'Data2': '10.954576354'
    }
]
Share:
35,015

Related videos on Youtube

Roman Rdgz
Author by

Roman Rdgz

Telecom Engineer

Updated on July 22, 2022

Comments

  • Roman Rdgz
    Roman Rdgz almost 2 years

    I have a text file that has content like this:

    ******** ENTRY 01 ********
    ID:                  01
    Data1:               0.1834869385E-002
    Data2:              10.9598489301
    Data3:              -0.1091356549E+001
    Data4:                715
    

    And then an empty line, and repeats more similar blocks, all of them with the same data fields.

    I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV!

    Any idea for this?

  • Admin
    Admin almost 11 years
    I think this does not group into records.
  • Peter Varo
    Peter Varo almost 11 years
    It is better and safer to use with open('datafile') as f: than simply open('datafile'), because it will automatically close the file, even exception is occurred.
  • Roman Rdgz
    Roman Rdgz almost 11 years
    Very fancy! But I have allergy to RegEx, so I prefer Martijn solution due to my limitations. Thanks for your great answer anyway!
  • Peter Varo
    Peter Varo almost 11 years
    @RomanRdgz Thank you! Actually I'm in love with regular expressions :)
  • 6502
    6502 almost 11 years
    @PeterVaro: better and safer and longer and (IMO) a bit uglier. Fixed anyway as apparently Python is going there