Using Diff on a specific column in a file
Solution 1
awk
is a better tool for comparing columns of files. See, for example, the answer to: compare two columns of different files and print if it matches -- there are similar answers out there for printing lines for matching columns.
Since you want to print lines that don't match, we can create an awk
command that prints the lines in file2 for which column 2 has not been seen in file1:
$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' file1 file2
Another 193 stuff2
Another 783 stuff3
As explained similarly by terdon in the above-mentioned question,
-
NR==FNR
: NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read. -
c[$2]++; next
: if this is the 1st file, save the 2nd field in thec
array. Then, skip to the next line so that this is only applied on the 1st file. -
c[$2] == 0
: the else block will only be executed if this is the second file so we check whether field 2 of this file has already been seen (c[$2]==0
) and if it has been, we print the line. Inawk
, the default action is to print the line so ifc[$2]==0
is true, the line will be printed.
But you also want the lines from file1 for which column 2 doesn't match in file2. This you can get by simply exchanging their position in the same command:
$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' file2 file1
Something 456 item2
Something 768 item3
So now you can generate the output you want, by using awk
twice. Perhaps someone with more awk
expertise can get it done in one pass.
You tagged your question with /ksh
, so I'll assume you are using korn shell. In ksh
you can define a function for your diff, say diffcol2
, to make your job easier:
diffcol2()
{
awk 'NR==FNR{c[$2]++;next};c[$2] == 0' $2 $1
awk 'NR==FNR{c[$2]++;next};c[$2] == 0' $1 $2
}
This has the behavior you desire:
$ diffcol2 file1 file2
Something 456 item2
Something 768 item3
Another 193 stuff2
Another 783 stuff3
Solution 2
I don't think diff (even in combination with cut) will be flexible enough to handle this. And it seems as though what you really want is keys in file1 that are not in file2 and vice versa - not strictly a line-by-line diff. If the input files are big, I would go with perl, but for small files this awk script works for the input provided:
%cat a.awk
BEGIN {
while (getline < "file1") {
line=$0;
split(line,f," ");
key=f[2];
f1[key]=line
}
while (getline < "file2") {
line=$0;
split(line,f," ");
key=f[2];
f2[key]=line
}
}
END {
for (c in f1) {
if (c in f2 == 0) print f1[c]
}
for (c in f2) {
if (c in f1 == 0) print f2[c]
}
}
And this is how you run it (note the trick with /dev/null, since awk expects an input file as a parameter:
%awk -f a.awk /dev/null
Something 456 item2
Something 768 item3
Another 193 stuff2
Another 783 stuff3
Related videos on Youtube
Admin
Updated on September 18, 2022Comments
-
Admin over 1 year
Will it be possible to use diff on a specific columns in a file?
file1
Something 123 item1 Something 456 item2 Something 768 item3 Something 353 item4
file2
Another 123 stuff1 Another 193 stuff2 Another 783 stuff3 Another 353 stuff4
output(Expected)
Something 456 item2 Something 768 item3 Another 193 stuff2 Another 783 stuff3
I want to
diff
the 2nd column of each file, then, the result will contain the diff-ed column but along with the whole line. -
Ian McGowan over 9 yearsNice! The NR==FNR trick is nifty, and I like the way it gets wrapped up into a two line function. Magical, but a great explanation of all the complicated bits - you have my vote!
-
baptx over 3 yearsI had to use awk
-F '\t'
parameter to make it work since I was using CSV files with tab separator. Otherwise a space was considered as a separator for columns. unix.stackexchange.com/questions/134829/… -
Steven Lu about 3 yearsThis is really neat. However I have an unusual criticism for this code. Since this makes two calls to
awk
to get the desired output, if we send in redirected streams rather than actual files, it will only perform the first half of the work. I'm curious if (as an awk expert) you could come up with a way to get this done with only one call toawk
? :D