Combining 2 .csv files by common column

12,900

Solution 1

sort -t , -k index1 file1 > sorted1
sort -t , -k index2 file2 > sorted2
join -t , -1 index1 -2 index2 -a 1 -a 2 sorted1 sorted2

Solution 2

This is the classical "relational join" problem.

You have several algorithms.

  • Nested Loops. You read from one file to pick a "master" record. You read the entire other file locating all "detail" records that match the master. This is a bad idea.

  • Sort-Merge. You sort each file into a temporary copy based on the common key. You then merge both files by reading from the master and then reading all matching rows from the detail and writing the merged records.

  • Lookup. You read one of the files entirely into a dictionary in memory, indexed by the key field. This can be tricky for the detail file, where you'll have multiple children per key. Then you read the other file and lookup the matching records in the dictionary.

Of these, sort-merge is often the fastest. This is done entirely using the unix sort command.

Lookup Implementation

import csv
import collections

index = collections.defaultdict(list)

file1= open( "someFile", "rb" )
rdr= csv.DictReader( file1 )
for row in rdr:
    index[row['MPID']].append( row )
file1.close()

file2= open( "anotherFile", "rb" )
rdr= csv.DictReader( file2 )
for row in rdr:
    print row, index[row['MPID']]
file2.close()

Solution 3

You'll need to look at the join command in the shell. You will also need to sort the data, and probably lose the first lines. The whole process will fall flat if any of the data contains commas. Or you will need to process the data with a CSV-sensitive process that introduces a different field separator (perhaps control-A) that you can use to split fields unambiguously.

The alternative, using Python, reads the two files into a pair of dictionaries (keyed on the common column(s)) and then use a loop to cover all the elements in the smaller of the two dictionaries, looking for matching values in the other. (This is basic nested loop query processing.)

Share:
12,900
Admin
Author by

Admin

Updated on July 28, 2022

Comments

  • Admin
    Admin over 1 year

    I have two .csv files where the first line in file 1 is:

    MPID,Title,Description,Model,Category ID,Category Description,Subcategory ID,Subcategory Description,Manufacturer ID,Manufacturer Description,URL,Manufacturer (Brand) URL,Image URL,AR Price,Price,Ship Price,Stock,Condition
    

    The first line from file 2:

    Regular Price,Sale Price,Manufacturer Name,Model Number,Retailer Category,Buy URL,Product Name,Availability,Shipping Cost,Condition,MPID,Image URL,UPC,Description
    

    and then rest of every file is filled with info.

    As you can see, both files have a common field called MPID (file 1: col 1, file 2: col 9, where the first col is col 1).

    I would like to create a new file which will combine these two files by looking at this column (as in: if there is an MPID that is in both files, then in the new file this MPID will appear with both its row from file 1 and its row from file 2). IF one MPID appears only in one file then it should also go into this combined file.

    The files are not sorted in any way.

    How do I do this on a debian machine with either a shell script or python?

    Thanks.

    EDIT: Both files dont have commas other than the ones separating the fields.

  • nullException
    nullException over 14 years
    join works great; but the input files must be sorted on the key. also it can't read arbitrary csv files. especifically, a comma within a quoted field would shift all fieldnums for that record
  • nullException
    nullException over 14 years
    beware of quoted commas! neither sort nor join obeys any quotation
  • Jonathan Leffler
    Jonathan Leffler over 14 years
    @Javier: agreed - which is why I updated my answer to say as much, even without seeing your comment (which was probably in preparation at the same time I was editing).
  • uman
    uman over 13 years
    Since Python 2.2, I believe, DictReader doesn't implement getitem. This was done for speed reasons. So the code: <code> index[rdr['MPID']].append( row ) </code> will fail with a AttributeError: DictReader instance has no attribute 'getitem'
  • user1066101
    user1066101 over 13 years
    @uman: The results of a DictReader are a first-class dict object. With all of the dict methods intact.
  • uman
    uman over 13 years
    @S.Lott: perhaps your code was incorrect. I'm VERY rusty with Python, but trying MPID's example with your code results in the AttributeError. See: pastebin.com/VJgSfA3u P.S. Anyone know how to embed formatting in comments?
  • user1066101
    user1066101 over 13 years
    @uman: Rather than do too much in comments, just open a fresh, new question with the code you don't like.
  • uman
    uman about 13 years
    No-- doing so would disassociate the response from the question that prompted it. I haven't downvoted your response because the rest of it is spot on. But your code's wrong.
  • user1066101
    user1066101 about 13 years
    @uman: What? You have a problem with the code, post the code with the question. A complete "association" with previous questions is trivial. A question can reference another question. What's the problem?
  • tripleee
    tripleee over 4 years
    To somewhat expand on Javier's comment, CSV format allows columns to contain literal commas by double quoting them. The line one,"two,more two","three, ""more, still more three""" contains three CSV columns which exhibit quoted commas and even quoted commas inside escaped quotes in the third column. (Different CSV dialects with different escaping mechanisms exist, too.) The only proper solution to this is to use an actual CSV parser, such as Python's csv module.
  • tripleee
    tripleee over 4 years
    This is severe overkill unless your data is too big for join or Awk, and/or you need to perform multiple joins over the same data.
  • tripleee
    tripleee over 4 years
    This module still exists in PyPI but seems unmaintained (no activity since 2012; homepage link doesn't work).
  • tripleee
    tripleee over 4 years
    I found the near-abandoned Github repo which points to bubbles.databrewery.org as its heir and replacement.