Compare two files: lines present in one, not in the other, by one column comparison

7,983

Solution 1

join requires that the files be presorted, as they are in the esample's args to join), so if you need to manintain the sequence ot the output, it would need different approach. Note, it doesn't try to keep the width of the original field spacing.

join -1 2 -2 2 -v 1 <(sort file1) <(sort file2)

output

21 12342 2
21 12349 7

Solution 2

One awk solution:

awk '
    FNR == NR {
        data[ $2 ] = 1;
        next;
    }
    FNR < NR {
        if ( ! ($2 in data) ) {
            print $0;
        }
    }
' file2 file1

Result:

21  12342   2
21  12349   7

Solution 3

Using Python from the bash shell:

paddy$ python -c 'import sys
with open(sys.argv[2]) as f: file2col2 = {line.split()[1] for line in f}
with open(sys.argv[1]) as f: print("".join(line for line in f 
                                           if line.split()[1] not in file2col2))
' file1.tmp file2.tmp
21  12342   2
21  12349   7

paddy$ 

Solution 4

Using egrep and awk:

egrep -v -f <(awk '{printf "^%s[ ]+%s[ ]+\n", $1, $2}' file2) file1

The awk bit inside <() generates patterns based on the contents of file2. The egrep uses these patterns to match lines in file1, with -v inverting the matching, printing only the lines that don't match.

Share:
7,983

Related videos on Youtube

Admin
Author by

Admin

Updated on September 18, 2022

Comments

  • Admin
    Admin over 1 year

    I need to compare 2 files. Column 1 is the same in both files. Column 2 is what I want to compare: I want all lines in file 1 that are not in file 2 when comparing column 2. Column 3 is different in both files, even for lines where column 1 and 2 are identical. I cannot remove column 3, because as an output I want the lines from file 1 including this column.

    Here is an example:

    File 1

    21  12340   3
    21  12341   7
    21  12342   2
    21  12343   89
    21  12349   7
    

    File 2

    21  12340   55
    21  12341   7
    21  12343   89
    21  12344   7
    21  12346   88
    21  12347   3
    21  12348   37
    

    My output would be:

    21  12342   2
    21  12349   7
    
  • Peter.O
    Peter.O over 11 years
    Good +1... btw, you don't need the FNR < NR, because of the preceding next... also, you don't need the array value assignment. Defining the array's index is enough data[ $2 ];, and runse notably faster without it.