Compare two files with first column and remove duplicate row from 2nd file in shell script

17,844

Solution 1

You can use awk for this:

awk 'FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3

Explanation:

  • FNR == NR: This test is true when the number of records is equal to the number of records in the file. This is only true for the first file, for the second file NR will be equal to the number of lines of file1 + FNR.

  • a[$1]: Create an array element index of the first field of file1.

  • next: skip to the next record so no more processing is done on file1.

  • !($1 in a): See if the first field ($1) is present in the array, ie in file1, and print the whole line (to file3).

Based on one of the examples from the #awk wiki.

Solution 2

export LC_ALL=C
comm -13 <(sort f1) <(sort  f2)

Would report the lines that are only in f2.

export LC_ALL=C
join -v2 <(sort f1) <(sort f2)

Would report the lines of f2 whose first field is not found as the first field in any line of f1.

(you need a shell with support for process substitution like ksh93, zsh or bash).

Solution 3

Method 1# Bash

#!/usr/bin/env bash
file1=$1
file2=$2

[[ $# -ne 2 ]]  && { echo -e "\n\tUsage: \t$0 file1 file2\n"; exit 1; }

while read line
do

        if ! grep -q "${line%% .*}" $file1; then
                echo "${line}"
        fi

done < $file2

Method 2# Only Grep

grep -v "$(< file1)" file2

grep is Working, but not guaranty

Solution 4

Just for fun here's a solution in Perl:

#!/usr/bin/perl

# create names lookup table from first file
my %names;
while (<>) {
    (my $col1)= split / /, $_;
    $names{$col1} = 1;
    last if eof;
}

# scan second file
while (<>) {
    print if /^(\S+).*/ && not $names{$1};
}

Example

$ ./showdiffs.pl file1  file2
0BDB FC600_R5_TP  FX.B   33554640    6044364  18    6033105  18    6044364   0
0BDC FC600_R5_TP  FX.B   33554640    6613536  20    6481974  19    6613536   0
0BDD FC600_R5_TP  FX.B   33554640    4435848  13    4057170  12    4435848   0
0BDE FC600_R5_TP  FX.B   33554640    6620868  20    6249518  19    6620868   0

Details

The Perl solution above is comprised of 2 loops. The first loop reads all the lines in from file1 and creates a hash, %names where each column that we identify is added.

$names{11AA} = 1;

The 2nd while loop then runs over the 2nd file, file2, and each line's column 1 is identified using the regular expression:

^(\S+).*

The above says from the beginning of the line, match everything that isn't a space, and save it in the temporary variable $1. It's saved by wrapping parens around it. The .* says to match everything else on the line.

The next bit of that lines says to look up the column 1 bit that we just saved in $1 in the %names hash:

$names{$1}

If it's present there, then we don't want to print it. If it isn't there, then print it.

Solution 5

Lets get it as

File #1: file1.txt

File #2: file2.txt

Then run following on terminal

fgrep -vf test1.txt test2.txt > output.txt

output.txt will contain the desired results.

Explanation:

fgrep : print lines matching a pattern (from manual page)
-v  : get only non-matching rows
-f : obtain PATTERN from FILE (from manual page)
Share:
17,844

Related videos on Youtube

don_crissti
Author by

don_crissti

Updated on September 18, 2022

Comments

  • don_crissti
    don_crissti over 1 year

    I will ask my question with an example. I have 2 files:

    File #1:

    118D FC300_R5_TP  FX.B      32775       2112   6       2038   6       2112   0
    118E FC300_R5_TP  FX.B      32775       2136   7       2065   6       2136   0
    118F FC300_R5_TP  FX.B      32775       2124   6       2064   6       2124   0
    1190 FC300_R5_TP  FX.B     819210     814632  99     814609  99     814632   0
    1191 FC300_R5_TP  FX.B     819210     104100  13     103714  13     104100   0
    1192 FC300_R5_TP  FX.B    1638420    1609476  98    1609402  98    1609476   0
    1196 FC300_R5_TP  FX.B    1638420    1638432 100    1638379 100    1638432   0
    119A FC300_R5_TP  FX.B    3276840    3271776 100    3271698 100    3271776   0
    119E FC300_R5_TP  FX.B    3276840    3264120 100    3264034 100    3264120   0
    11A2 FC300_R5_TP  FX.B    3276840    2328648  71    2328546  71    2328648   0
    11A6 FC300_R5_TP  FX.B    3276840    2328444  71    2328355  71    2328444   0
    11AA FC300_R5_TP  FX.B    3276840    2328528  71    2328403  71    2328528   0
    11AE FC300_R5_TP  FX.B    3276840    2328648  71    2328468  71    2328648   0
    11B2 FC300_R5_TP  FX.B    3276840    2130000  65    2129766  65    2130000   0
    173A FC300_R5_TP  FX.B    6553680    6478572  99    6477747  99    6478572   0
    

    File #2:

    11AA FC300_R5_TP  FX.B    3276840    2328528  71    2328403  71    2328528   0
    11AE FC300_R5_TP  FX.B    3276840    2328648  71    2328468  71    2328648   0
    11B2 FC300_R5_TP  FX.B    3276840    2130000  65    2129766  65    2130000   0
    173A FC300_R5_TP  FX.B    6553680    6478572  99    6477747  99    6478572   0
    0BDB FC600_R5_TP  FX.B   33554640    6044364  18    6033105  18    6044364   0
    0BDC FC600_R5_TP  FX.B   33554640    6613536  20    6481974  19    6613536   0
    0BDD FC600_R5_TP  FX.B   33554640    4435848  13    4057170  12    4435848   0
    0BDE FC600_R5_TP  FX.B   33554640    6620868  20    6249518  19    6620868   0
    

    Desired Output

    File #3:

    0BDB FC600_R5_TP  FX.B   33554640    6044364  18    6033105  18    6044364   0
    0BDC FC600_R5_TP  FX.B   33554640    6613536  20    6481974  19    6613536   0
    0BDD FC600_R5_TP  FX.B   33554640    4435848  13    4057170  12    4435848   0
    0BDE FC600_R5_TP  FX.B   33554640    6620868  20    6249518  19    6620868   0
    

    I'd like to compare file 1 and file 2 using their first columns and remove the entire line or row from file 2 where they match in file 1. I'd also like to save the results to a 3rd file, file #3.

  • ekoeppen
    ekoeppen over 10 years
    Perfect Answer!!!
  • Adaephon
    Adaephon almost 10 years
    This only works if whole lines are identical, but the questioner explicitly asked for comparison only on the first column.