Parsing CSV / tab-delimited txt file with Python

117,441

Solution 1

Start by turning the text into a list of lists. That will take care of the parsing part:

lol = list(csv.reader(open('text.txt', 'rb'), delimiter='\t'))

The rest can be done with indexed lookups:

d = dict()
key = lol[6][0]      # cell A7
value = lol[6][3]    # cell D7
d[key] = value       # add the entry to the dictionary
 ...

Solution 2

Although there is nothing wrong with the other solutions presented, you could simplify and greatly escalate your solutions by using python's excellent library pandas.

Pandas is a library for handling data in Python, preferred by many Data Scientists.

Pandas has a simplified CSV interface to read and parse files, that can be used to return a list of dictionaries, each containing a single line of the file. The keys will be the column names, and the values will be the ones in each cell.

In your case:

    import pandas

    def create_dictionary(filename):
        my_data = pandas.DataFrame.from_csv(filename, sep='\t', index_col=False)
        # Here you can delete the dataframe columns you don't want!
        del my_data['B']
        del my_data['D']
        # ...
        # Now you transform the DataFrame to a list of dictionaries
        list_of_dicts = [item for item in my_data.T.to_dict().values()]
        return list_of_dicts

# Usage:
x = create_dictionary("myfile.csv")

Solution 3

If the file is large, you may not want to load it entirely into memory at once. This approach avoids that. (Of course, making a dict out of it could still take up some RAM, but it's guaranteed to be smaller than the original file.)

my_dict = {}
for i, line in enumerate(file):
    if (i - 8) % 7:
        continue
    k, v = line.split("\t")[:3:2]
    my_dict[k] = v

Edit: Not sure where I got extend from before. I meant update

Share:
117,441
thefragileomen
Author by

thefragileomen

Updated on January 21, 2021

Comments

  • thefragileomen
    thefragileomen over 3 years

    I currently have a CSV file which, when opened in Excel, has a total of 5 columns. Only columns A and C are of any significance to me and the data in the remaining columns is irrelevant.

    Starting on line 8 and then working in multiples of 7 (ie. lines 8, 15, 22, 29, 36 etc...), I am looking to create a dictionary with Python 2.7 with the information from these fields. The data in column A will be the key (a 6-digit integer) and the data in column C being the respective value for the key. I've tried to highlight this below but the formatting isn't the best:-

        A        B      C          D
    1                           CDCDCDCD  
    2                           VDDBDDB
    3
    4
    5
    6
    7  DDEFEEF                   FEFEFEFE
    8  123456         JONES
    9
    10
    11
    12
    13
    14
    15 293849         SMITH
    

    As per the above, I am looking to extract the value from A7 (DDEFEEF) as a key in my dictionary and "FEFEFEFE" being the respective data and then add another entry to my dictionary, jumping to line 15 with "2938495" being my key and "Smith" being the respective value.

    Any suggestions? The source file is a .txt file with entries being tab-delimited. Thanks

    Clarification:

    Just to clarify, so far, I have tried the below:-

    import csv
    
    mydict = {:}
    f = open("myfile", 'rt')
    reader = csv.reader(f)
        for row in reader:
            print row
    

    The above simply prints out all content though a row at a time. I did try "for row(7) in reader" but this returned an error. I then researched it and had a go at the below but it didn't work neither:

    import csv
    from itertools import islice
    
    entries = csv.reader(open("myfile", 'rb'))
    mydict = {'key' : 'value'}
    
    for i in xrange(6):
        mydict['i(0)] = 'I(2)    # integers representing columns
        range = islice(entries,6)
        for entry in range:
            mydict[entries(0) = entries(2)] # integers representing columns
    
  • thefragileomen
    thefragileomen over 12 years
    thanks for the reply. I've tried implementing the above and get a "ValueError" on the final line - "dictionary update sequence element #0 has length 1; 2 is required". Any suggestions?
  • kojiro
    kojiro over 12 years
    @thefragileomen edited for a more readable approach that doesn't require three levels of braces to work ;)
  • BLT
    BLT almost 2 years
    As of pandas version 0.21.0, pandas.DataFrame.from_csv() has been deprecated. pandas.read_csv() can be used instead.