Compare two files with first column and remove duplicate row from 2nd file in shell script
Solution 1
You can use awk
for this:
awk 'FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3
Explanation:
FNR == NR
: This test is true when the number of records is equal to the number of records in the file. This is only true for the first file, for the second fileNR
will be equal to the number of lines of file1 +FNR
.a[$1]
: Create an array element index of the first field of file1.next
: skip to the next record so no more processing is done on file1.!($1 in a)
: See if the first field ($1) is present in the array, ie in file1, and print the whole line (to file3).
Based on one of the examples from the #awk wiki.
Solution 2
export LC_ALL=C
comm -13 <(sort f1) <(sort f2)
Would report the lines that are only in f2
.
export LC_ALL=C
join -v2 <(sort f1) <(sort f2)
Would report the lines of f2
whose first field is not found as the first field in any line of f1
.
(you need a shell with support for process substitution like ksh93
, zsh
or bash
).
Solution 3
Method 1# Bash
#!/usr/bin/env bash
file1=$1
file2=$2
[[ $# -ne 2 ]] && { echo -e "\n\tUsage: \t$0 file1 file2\n"; exit 1; }
while read line
do
if ! grep -q "${line%% .*}" $file1; then
echo "${line}"
fi
done < $file2
Method 2# Only Grep
grep -v "$(< file1)" file2
grep is Working, but not guaranty
Solution 4
Just for fun here's a solution in Perl:
#!/usr/bin/perl
# create names lookup table from first file
my %names;
while (<>) {
(my $col1)= split / /, $_;
$names{$col1} = 1;
last if eof;
}
# scan second file
while (<>) {
print if /^(\S+).*/ && not $names{$1};
}
Example
$ ./showdiffs.pl file1 file2
0BDB FC600_R5_TP FX.B 33554640 6044364 18 6033105 18 6044364 0
0BDC FC600_R5_TP FX.B 33554640 6613536 20 6481974 19 6613536 0
0BDD FC600_R5_TP FX.B 33554640 4435848 13 4057170 12 4435848 0
0BDE FC600_R5_TP FX.B 33554640 6620868 20 6249518 19 6620868 0
Details
The Perl solution above is comprised of 2 loops. The first loop reads all the lines in from file1
and creates a hash, %names
where each column that we identify is added.
$names{11AA} = 1;
The 2nd while
loop then runs over the 2nd file, file2
, and each line's column 1 is identified using the regular expression:
^(\S+).*
The above says from the beginning of the line, match everything that isn't a space, and save it in the temporary variable $1
. It's saved by wrapping parens around it. The .*
says to match everything else on the line.
The next bit of that lines says to look up the column 1 bit that we just saved in $1
in the %names
hash:
$names{$1}
If it's present there, then we don't want to print it. If it isn't there, then print it.
Solution 5
Lets get it as
File #1: file1.txt
File #2: file2.txt
Then run following on terminal
fgrep -vf test1.txt test2.txt > output.txt
output.txt will contain the desired results.
Explanation:
fgrep : print lines matching a pattern (from manual page)
-v : get only non-matching rows
-f : obtain PATTERN from FILE (from manual page)
Related videos on Youtube
don_crissti
Updated on September 18, 2022Comments
-
don_crissti over 1 year
I will ask my question with an example. I have 2 files:
File #1:
118D FC300_R5_TP FX.B 32775 2112 6 2038 6 2112 0 118E FC300_R5_TP FX.B 32775 2136 7 2065 6 2136 0 118F FC300_R5_TP FX.B 32775 2124 6 2064 6 2124 0 1190 FC300_R5_TP FX.B 819210 814632 99 814609 99 814632 0 1191 FC300_R5_TP FX.B 819210 104100 13 103714 13 104100 0 1192 FC300_R5_TP FX.B 1638420 1609476 98 1609402 98 1609476 0 1196 FC300_R5_TP FX.B 1638420 1638432 100 1638379 100 1638432 0 119A FC300_R5_TP FX.B 3276840 3271776 100 3271698 100 3271776 0 119E FC300_R5_TP FX.B 3276840 3264120 100 3264034 100 3264120 0 11A2 FC300_R5_TP FX.B 3276840 2328648 71 2328546 71 2328648 0 11A6 FC300_R5_TP FX.B 3276840 2328444 71 2328355 71 2328444 0 11AA FC300_R5_TP FX.B 3276840 2328528 71 2328403 71 2328528 0 11AE FC300_R5_TP FX.B 3276840 2328648 71 2328468 71 2328648 0 11B2 FC300_R5_TP FX.B 3276840 2130000 65 2129766 65 2130000 0 173A FC300_R5_TP FX.B 6553680 6478572 99 6477747 99 6478572 0
File #2:
11AA FC300_R5_TP FX.B 3276840 2328528 71 2328403 71 2328528 0 11AE FC300_R5_TP FX.B 3276840 2328648 71 2328468 71 2328648 0 11B2 FC300_R5_TP FX.B 3276840 2130000 65 2129766 65 2130000 0 173A FC300_R5_TP FX.B 6553680 6478572 99 6477747 99 6478572 0 0BDB FC600_R5_TP FX.B 33554640 6044364 18 6033105 18 6044364 0 0BDC FC600_R5_TP FX.B 33554640 6613536 20 6481974 19 6613536 0 0BDD FC600_R5_TP FX.B 33554640 4435848 13 4057170 12 4435848 0 0BDE FC600_R5_TP FX.B 33554640 6620868 20 6249518 19 6620868 0
Desired Output
File #3:
0BDB FC600_R5_TP FX.B 33554640 6044364 18 6033105 18 6044364 0 0BDC FC600_R5_TP FX.B 33554640 6613536 20 6481974 19 6613536 0 0BDD FC600_R5_TP FX.B 33554640 4435848 13 4057170 12 4435848 0 0BDE FC600_R5_TP FX.B 33554640 6620868 20 6249518 19 6620868 0
I'd like to compare file 1 and file 2 using their first columns and remove the entire line or row from file 2 where they match in file 1. I'd also like to save the results to a 3rd file, file #3.
-
ekoeppen over 10 yearsPerfect Answer!!!
-
Adaephon almost 10 yearsThis only works if whole lines are identical, but the questioner explicitly asked for comparison only on the first column.