Using Diff on a specific column in a file

24,264

Solution 1

awk is a better tool for comparing columns of files. See, for example, the answer to: compare two columns of different files and print if it matches -- there are similar answers out there for printing lines for matching columns.

Since you want to print lines that don't match, we can create an awk command that prints the lines in file2 for which column 2 has not been seen in file1:

$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' file1 file2
Another   193 stuff2
Another   783 stuff3

As explained similarly by terdon in the above-mentioned question,

  • NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
  • c[$2]++; next : if this is the 1st file, save the 2nd field in the c array. Then, skip to the next line so that this is only applied on the 1st file.
  • c[$2] == 0 : the else block will only be executed if this is the second file so we check whether field 2 of this file has already been seen (c[$2]==0) and if it has been, we print the line. In awk, the default action is to print the line so if c[$2]==0 is true, the line will be printed.

But you also want the lines from file1 for which column 2 doesn't match in file2. This you can get by simply exchanging their position in the same command:

$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' file2 file1
Something  456 item2
Something  768 item3

So now you can generate the output you want, by using awk twice. Perhaps someone with more awk expertise can get it done in one pass.

You tagged your question with /ksh, so I'll assume you are using korn shell. In ksh you can define a function for your diff, say diffcol2, to make your job easier:

diffcol2()
{
   awk 'NR==FNR{c[$2]++;next};c[$2] == 0' $2 $1      
   awk 'NR==FNR{c[$2]++;next};c[$2] == 0' $1 $2      
}

This has the behavior you desire:

$ diffcol2 file1 file2
Something  456 item2
Something  768 item3
Another   193 stuff2
Another   783 stuff3

Solution 2

I don't think diff (even in combination with cut) will be flexible enough to handle this. And it seems as though what you really want is keys in file1 that are not in file2 and vice versa - not strictly a line-by-line diff. If the input files are big, I would go with perl, but for small files this awk script works for the input provided:

%cat a.awk

BEGIN {
  while (getline < "file1") {
    line=$0;
    split(line,f," ");
    key=f[2];
    f1[key]=line
  }
  while (getline < "file2") {
    line=$0;
    split(line,f," ");
    key=f[2];
    f2[key]=line
  }
}
END {
  for (c in f1) {
    if (c in f2 == 0) print f1[c]
  }
  for (c in f2) {
    if (c in f1 == 0) print f2[c]
  }
}

And this is how you run it (note the trick with /dev/null, since awk expects an input file as a parameter:

%awk -f a.awk /dev/null
Something  456 item2
Something  768 item3
Another   193 stuff2
Another   783 stuff3
Share:
24,264

Related videos on Youtube

Admin
Author by

Admin

Updated on September 18, 2022

Comments

  • Admin
    Admin over 1 year

    Will it be possible to use diff on a specific columns in a file?

    file1

    Something  123 item1
    Something  456 item2
    Something  768 item3
    Something  353 item4
    

    file2

    Another   123 stuff1
    Another   193 stuff2
    Another   783 stuff3
    Another   353 stuff4
    

    output(Expected)

    Something  456 item2
    Something  768 item3
    Another   193 stuff2
    Another   783 stuff3
    

    I want to diff the 2nd column of each file, then, the result will contain the diff-ed column but along with the whole line.

  • Ian McGowan
    Ian McGowan over 9 years
    Nice! The NR==FNR trick is nifty, and I like the way it gets wrapped up into a two line function. Magical, but a great explanation of all the complicated bits - you have my vote!
  • baptx
    baptx over 3 years
    I had to use awk -F '\t' parameter to make it work since I was using CSV files with tab separator. Otherwise a space was considered as a separator for columns. unix.stackexchange.com/questions/134829/…
  • Steven Lu
    Steven Lu about 3 years
    This is really neat. However I have an unusual criticism for this code. Since this makes two calls to awk to get the desired output, if we send in redirected streams rather than actual files, it will only perform the first half of the work. I'm curious if (as an awk expert) you could come up with a way to get this done with only one call to awk? :D