diff stop after first difference

7,725

Solution 1

cmp stops at the first difference:

% cat foo
foo
bar
baz
---
foo
bar
baz
% cat bar
foo
bar
baz
---
foo+
bar+
baz+
% cmp foo bar
foo bar differ: byte 20, line 5
% 

You could wrap a script around it in order to print the different lines:

#! /bin/bash
line=$(cmp "$1" "$2" | awk '{print $NF}')
if [ ! -z $line ]; then
    awk -v file="$1" -v line=$line 'NR==line{print "In file "file": "$0; exit}' "$1"
    awk -v file="$2" -v line=$line 'NR==line{print "In file "file": "$0; exit}' "$2"
 fi
% ./script.sh foo bar
In file foo: foo
In file bar: foo+

Part of the cost is now shifted to the AWK commands, but it should be significantly faster than checking both files entirely.

Solution 2

I tested this with the trivial cases but leave the field test to you:

$ cat f1
l1
l21 l22       l23  l24


l3
l4x
l5


$ cat f2
l1
l21 l22       l23

l3
l4y
l5



$ cat awkdiff.awk


BEGIN {
   maxdiff = 5
   ignoreemptylines = 1
   whitespaceaware = 1

   if (whitespaceaware) {
      emptypattern = "^[[:space:]]*$"
   } else {
      emptypattern = "^$"
      FS=""
   }

   f1 = ARGV[1]
   f2 = ARGV[2]

   rc1=rc2=1
   while( (rc1>0 && rc2>0 && diff<maxdiff)  ) {
      rc1 = getline l1 < f1 ; ++nr1
      rc2 = getline l2 < f2 ; ++nr2

      if (ignoreemptylines) {
         while ( l1 ~ emptypattern   &&  rc1>0) {
            rc1 = getline l1 < f1 ; ++nr1
         }

         while ( l2 ~ emptypattern  &&  rc2>0) {
            rc2 = getline l2 < f2 ; ++nr2
         }
      }


      if ( rc1>0 && rc2>0) {
         nf1 = split( l1, a1)
         nf2 = split( l2, a2)

         if ( nf1 <= nf2) {
            nfmin = nf1
         } else {
            nfmin = nf2
         }

         founddiff = 0
         for (i=1; i<=nfmin; ++i) {
            if ( a2[i]"" != a1[i]"") {
               printf "%d:%d:{%s} != %d:%d:{%s}\n", \
                  nr1, nf1, a1[i], nr2, nf2, a2[i]
               founddiff=1
               ++diff
               break
            }
         }

         if ( !founddiff  &&  nf1 != nf2) {
            if ( nf1 > nf2)
               printf "%d:%d:{%s} != %d:EOL\n", nr1, nfmin+1, a1[nfmin+1], nr2
            else
               printf "%d:EOL != %d:%d:{%s}\n", nr1, nr2, nfmin+1, a2[nfmin+1]
            ++diff
         }
      } else {
         if ( rc1 == -1 && rc2 == -1) {
            print "IO error"
         } else if ( rc1 == 1 && rc2 == 0) {
            print "%d:%s != EOL\n", nr1, l1
         } else if ( rc1 == 0 && rc2 == 1) {
            printf "EOL != %d:%s\n", nr2, l2
         }
      }
   }
}


$ awk -f awkdiff.awk  /tmp/f1 /tmp/f2
2:4:{l24} != 2:EOL
6:1:{l4x} != 5:1:{l4y}

maxdiff = N: sets the maximum number of differences at which comparison should stop

ignoreemptylines = 1|0: specifies if empty lines should be ignored when comparing

whitespaceaware = 1|0: specifies if comparison should be done wordwise (assuming consecutive whitespaces equal) or linewise

Share:
7,725

Related videos on Youtube

TTT
Author by

TTT

Updated on September 18, 2022

Comments

  • TTT
    TTT over 1 year

    I'd like to perform a diff on 2 files and have it cease at the first difference. I don't require that the command be done via diff, of course, but I do require that the actual command cease once the first difference is found and reported. I'm running on some very large files, and expect a perfect match, but still want to know what the difference was, should one be found, so diff -q, diff ... |head -1, and cmp are no good. And, since the files are very large, something that doesn't exhaust memory would be nice. Although not necessary for my current problem, bonus points for solutions that work for the first (user-specified) n differences, and for ones that can ignore whitespace differences.

  • kos
    kos over 8 years
    @TTT Not sure what you mean, the script I proposed shows the first different line in both files (in the example that is line 5).
  • TTT
    TTT over 8 years
    Whoops, meant to delete the comment, accidental post. Deleting now.
  • Vladimir Panteleev
    Vladimir Panteleev almost 8 years
    Warning! If one file is a prefix of the other, cmp will simply print cmp: EOF on shorter-file. If this can happen with your input, make sure to handle this edge case.