extracting unique values between 2 sets/files

58,784

Solution 1

$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7

Explanation of how the code works:

  • If we're working on file1, track each line of text we see.
  • If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

  • FNR is the current file's record number
  • NR is the current overall record number from all input files
  • FNR==NR is true only when we are reading file1
  • $0 is the current line of text
  • a[$0] is a hash with the key set to the current line of text
  • a[$0]++ tracks that we've seen the current line of text
  • !($0 in a) is true only when we have not seen the line text
  • Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given

Solution 2

Using some lesser-known utilities:

sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted

This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:

sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

Solution 3

I was wondering which of the following solutions was the "fastest" for "larger" files:

awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2          # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2

Results of my benchmarks in short:

  • Do not use grep -Fxf, it's much slower (2-4 times in my tests).
  • comm is slightly faster than join.
  • If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
  • awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.

For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was

# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
  321599   321599  8098710 file1
  321603   321603  8098794 file2

Typical results of fastest runs

awk2: real 0m1.145s  user 0m1.088s  sys 0m0.056s  user+sys 1.144
awk1: real 0m1.369s  user 0m1.324s  sys 0m0.044s  user+sys 1.368
comm: real 0m0.980s  user 0m1.608s  sys 0m0.184s  user+sys 1.792
join: real 0m1.080s  user 0m1.756s  sys 0m0.140s  user+sys 1.896
grep: real 0m4.005s  user 0m3.844s  sys 0m0.160s  user+sys 4.004

BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:

awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2

Solution 4

with grep:

grep -F -x -v -f file_1 file_2 

Solution 5

How about:

diff file_1 file_2 | grep '^>' | cut -c 3-

This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.

The files don't even need to be sorted.

Share:
58,784
Admin
Author by

Admin

Updated on November 28, 2021

Comments

  • Admin
    Admin over 2 years

    Working in linux/shell env, how can I accomplish the following:

    text file 1 contains:

    1
    2
    3
    4
    5
    

    text file 2 contains:

    6
    7
    1
    2
    3
    4
    

    I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.

    How do I do this from the command line?

    many thanks!