Compare two files: lines present in one, not in the other, by one column comparison
Solution 1
join
requires that the files be presorted, as they are in the esample's args to join
), so if you need to manintain the sequence ot the output, it would need different approach. Note, it doesn't try to keep the width of the original field spacing.
join -1 2 -2 2 -v 1 <(sort file1) <(sort file2)
output
21 12342 2
21 12349 7
Solution 2
One awk
solution:
awk '
FNR == NR {
data[ $2 ] = 1;
next;
}
FNR < NR {
if ( ! ($2 in data) ) {
print $0;
}
}
' file2 file1
Result:
21 12342 2
21 12349 7
Solution 3
Using Python from the bash shell:
paddy$ python -c 'import sys
with open(sys.argv[2]) as f: file2col2 = {line.split()[1] for line in f}
with open(sys.argv[1]) as f: print("".join(line for line in f
if line.split()[1] not in file2col2))
' file1.tmp file2.tmp
21 12342 2
21 12349 7
paddy$
Solution 4
Using egrep
and awk
:
egrep -v -f <(awk '{printf "^%s[ ]+%s[ ]+\n", $1, $2}' file2) file1
The awk
bit inside <()
generates patterns based on the contents of file2
. The egrep
uses these patterns to match lines in file1
, with -v
inverting the matching, printing only the lines that don't match.
Related videos on Youtube
Admin
Updated on September 18, 2022Comments
-
Admin over 1 year
I need to compare 2 files. Column 1 is the same in both files. Column 2 is what I want to compare: I want all lines in file 1 that are not in file 2 when comparing column 2. Column 3 is different in both files, even for lines where column 1 and 2 are identical. I cannot remove column 3, because as an output I want the lines from file 1 including this column.
Here is an example:
File 1
21 12340 3 21 12341 7 21 12342 2 21 12343 89 21 12349 7
File 2
21 12340 55 21 12341 7 21 12343 89 21 12344 7 21 12346 88 21 12347 3 21 12348 37
My output would be:
21 12342 2 21 12349 7
-
Peter.O over 11 yearsGood +1... btw, you don't need the
FNR < NR
, because of the precedingnext
... also, you don't need the array value assignment. Defining the array's index is enoughdata[ $2 ];
, and runse notably faster without it.