Using Diff on a specific column in a file

shell-script scripting ksh diff file-comparison

24,264

Solution 1

awk is a better tool for comparing columns of files. See, for example, the answer to: compare two columns of different files and print if it matches -- there are similar answers out there for printing lines for matching columns.

Since you want to print lines that don't match, we can create an awk command that prints the lines in file2 for which column 2 has not been seen in file1:

$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' file1 file2
Another   193 stuff2
Another   783 stuff3

As explained similarly by terdon in the above-mentioned question,

NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
c[$2]++; next : if this is the 1st file, save the 2nd field in the c array. Then, skip to the next line so that this is only applied on the 1st file.
c[$2] == 0 : the else block will only be executed if this is the second file so we check whether field 2 of this file has already been seen (c[$2]==0) and if it has been, we print the line. In awk, the default action is to print the line so if c[$2]==0 is true, the line will be printed.

But you also want the lines from file1 for which column 2 doesn't match in file2. This you can get by simply exchanging their position in the same command:

$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' file2 file1
Something  456 item2
Something  768 item3

So now you can generate the output you want, by using awk twice. Perhaps someone with more awk expertise can get it done in one pass.

You tagged your question with /ksh, so I'll assume you are using korn shell. In ksh you can define a function for your diff, say diffcol2, to make your job easier:

diffcol2()
{
   awk 'NR==FNR{c[$2]++;next};c[$2] == 0' $2 $1      
   awk 'NR==FNR{c[$2]++;next};c[$2] == 0' $1 $2      
}

This has the behavior you desire:

$ diffcol2 file1 file2
Something  456 item2
Something  768 item3
Another   193 stuff2
Another   783 stuff3

Solution 2

I don't think diff (even in combination with cut) will be flexible enough to handle this. And it seems as though what you really want is keys in file1 that are not in file2 and vice versa - not strictly a line-by-line diff. If the input files are big, I would go with perl, but for small files this awk script works for the input provided:

%cat a.awk

BEGIN {
  while (getline < "file1") {
    line=$0;
    split(line,f," ");
    key=f[2];
    f1[key]=line
  }
  while (getline < "file2") {
    line=$0;
    split(line,f," ");
    key=f[2];
    f2[key]=line
  }
}
END {
  for (c in f1) {
    if (c in f2 == 0) print f1[c]
  }
  for (c in f2) {
    if (c in f1 == 0) print f2[c]
  }
}

And this is how you run it (note the trick with /dev/null, since awk expects an input file as a parameter:

%awk -f a.awk /dev/null
Something  456 item2
Something  768 item3
Another   193 stuff2
Another   783 stuff3

24,264

Admin

Updated on September 18, 2022

Comments

Admin over 1 year

Will it be possible to use diff on a specific columns in a file?

file1

Something  123 item1
Something  456 item2
Something  768 item3
Something  353 item4

file2

Another   123 stuff1
Another   193 stuff2
Another   783 stuff3
Another   353 stuff4

output(Expected)

Something  456 item2
Something  768 item3
Another   193 stuff2
Another   783 stuff3

I want to diff the 2nd column of each file, then, the result will contain the diff-ed column but along with the whole line.

Ian McGowan over 9 years

Nice! The NR==FNR trick is nifty, and I like the way it gets wrapped up into a two line function. Magical, but a great explanation of all the complicated bits - you have my vote!
baptx over 3 years

I had to use awk -F '\t' parameter to make it work since I was using CSV files with tab separator. Otherwise a space was considered as a separator for columns. unix.stackexchange.com/questions/134829/…
Steven Lu about 3 years

This is really neat. However I have an unusual criticism for this code. Since this makes two calls to awk to get the desired output, if we send in redirected streams rather than actual files, it will only perform the first half of the work. I'm curious if (as an awk expert) you could come up with a way to get this done with only one call to awk? :D