Remove entries from one CSV file that are already present in another

27,599

Solution 1

I'm assuming your csv files are something like:

File1

123123,,
222333,,

File2

111222,Jones,Sally
111333,Johnson,Roger
123123,Doe,John
444555,Richardson,George
222333,Smith,Jane
223456,Alexander,Philip

You could try using the join command, like so:

# join -t, -v 2 <(sort file1) <(sort file2)
111222,Jones,Sally
111333,Johnson,Roger
223456,Alexander,Philip
444555,Richardson,George

More information about the command can be found here: man join

join [OPTION]... FILE1 FILE2

-t CHAR
    use CHAR as input and output field separator 
-v FILENUM
    like -a FILENUM, but suppress joined output lines 

Solution 2

Try this:

awk 'BEGIN{FS=","};FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3

Solution 3

You can also try the following Python2 solution:

#!/usr/bin/env python2
import csv
with open('file_1') as f1:
    file_1_list = [line[0] for line in csv.reader(f1)]
with open('file_2') as f2:
    for line in csv.reader(f2):
        if line[0] not in file_1_list:
            print ' '.join(line)
Share:
27,599

Related videos on Youtube

pgrason
Author by

pgrason

Updated on September 18, 2022

Comments

  • pgrason
    pgrason almost 2 years

    I have two files: 'file1' has employee ID numbers, 'file2' has the complete database of the employees. Here is what they look like:

    • file1
      123123
      222333
      
    • file2
      111222 Jones Sally
      111333 Johnson Roger
      123123 Doe John
      444555 Richardson George
      222333 Smith Jane
      223456 Alexander Philip
      

    I want to compare the two files and eliminate the entries from file2 that have ID numbers in file1.

    I found this awk command which works perfectly:

    awk 'FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3
    

    The result:

    • file3
      111222 Jones Sally
      111333 Johnson Roger
      444555 Richardson George
      223456 Alexander Philip
      

    So this works as expected.

    My problem is that the files are actually simplified .csv files, and I must use a comma as a separator rather than a space. I have tried everything I can think of to make this work (i.e -F, , -F',' , -F"," everywhere in the command) and no success.

    How do I get this to work with .csv files?

    By the way, I am on MacBook Pro, OSX Lion!

    • Admin
      Admin over 9 years
      Did you have a space after -F?
  • peterh
    peterh over 9 years
    The idea is okay, but a code snippet-only answer is not.
  • pgrason
    pgrason over 9 years
    "join" works, thanks. However, sometimes I want to use a different field in the files. So maybe the "awk" is better.
  • pgrason
    pgrason over 9 years
    This works the way I want. I can chose which field to use as the key. Thanks. Will there be any problems with very large file1 & file2?
  • pgrason
    pgrason over 9 years
    I just tried this command on two large .csv files and it worked just as I wanted. Thanks!
  • devnull
    devnull over 9 years
    @pgrason Define 'fields', if there is a common field in both, join should always work.
  • Matthias B
    Matthias B almost 6 years
    @pgrason Is this the way you solved your problem? Then please accept the answer, so others know what worked for you.