Difference of two big files
Solution 1
Sounds like a job for comm
:
$ comm -3 <(sort test1.csv) <(sort test2.csv)
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
As explained in man comm
:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
So, the -3
means that only lines that are unique to one of the files will be printed. However, those are indented according to which file they were found in. To remove the tab, use:
$ comm -3 <(sort test1.csv) <(sort test2.csv) | tr -d '\t'
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
In this case, you don't really even need to sort the files and you can simplify the above to:
comm -3 test1.csv test2.csv | tr -d '\t' > difference.csv
Solution 2
Using grep
with bash
process substitution:
$ cat <(grep -vFf test2.csv test1.csv) <(grep -vFf test1.csv test2.csv)
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
To save the output as results.csv
:
cat <(grep -vFf test2.csv test1.csv) <(grep -vFf test1.csv test2.csv) >results.csv
<()
is thebash
process substitution patterngrep -vFf test2.csv test1.csv
will find the lines unique to onlytest1.csv
grep -vFf test1.csv test2.csv
will find the lines unique to onlytest2.csv
Finally we are summing up the results by
cat
Or as Oli suggested, you can use command grouping also:
$ { grep -vFf test2.csv test1.csv; grep -vFf test1.csv test2.csv; }
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
Or just run one after another, as they are both writing to STDOUT they will ultimately get added:
$ grep -vFf test2.csv test1.csv; grep -vFf test1.csv test2.csv
100,300,500,700
100,4,2,1,7
21,22,23,24,25
50,25,700,5
Solution 3
If the order of rows is not relevant, use awk
or perl
:
awk '{seen[$0]++} END {for (i in seen) {if (seen[i] == 1) {print i}}}' 1.csv 2.csv
Use grep
to get the common lines and filter those out:
grep -hxvFf <(grep -Fxf 1.csv 2.csv) 1.csv 2.csv
The internal grep gets the common lines, then the external grep finds lines which don't match these common lines.
Solution 4
Use the --*-line-format=...
options of diff
You can tell diff
exactly what you need - explained below:
diff --old-line-format='%L' --new-line-format='%L' --unchanged-line-format='' f1.txt f2.txt
It is possible to specify the output of diff in a very detailed way, similar to a printf
number format.
The lines from the first file, test1.csv
are called "old" lines, and the lines from the second, test2.csv
, are "new" lines. That makes sense when diff
is used to see what changed in a file.
The options we need are the ones to set the format for "old" lines, "new" lines, and "unchanged" lines.
The formats we need are very simple:
For the changed lines, new and old, we want to output only the text of the lines. %L
is the format symbol for the line text.
For the unchanged lines, we want to show nothing.
With this, we can write options like --old-line-format='%L'
, and put it all together, using your example data:
$ diff --old-line-format='%L' --new-line-format='%L' --unchanged-line-format='' test1.csv test2.csv
100,4,2,1,7
100,300,500,700
21,22,23,24,25
50,25,700,5
Notes on performance
Because the files have different size, try to exchange the input files if it does not matter, it could be that the inner workings of diff
can handle one way better than the other. Better is either needing less memory, or less computation.
There is an optimisation option for using diff
with large files: --speed-large-files
. It uses assumptions about the file structure, so it's not clear whether it helps in your case, but worth trying it.
The format options are described in the man diff
under --LTYPE-line-format=LFMT
.
Solution 5
Since the order doesn't need to be preserved, simply:
sort test1.csv test2.csv | uniq -u
-
sort test1.csv test2.csv
: merges and sortstest1.csv
andtest2.csv
-
uniq -u
: prints only the lines which have no duplicate
Related videos on Youtube
Lynob
Updated on September 18, 2022Comments
-
Lynob over 1 year
I have "test1.csv" and it contains
200,400,600,800 100,300,500,700 50,25,125,310
and test2.csv and it contains
100,4,2,1,7 200,400,600,800 21,22,23,24,25 50,25,125,310 50,25,700,5
now
diff test2.csv test1.csv > result.csv
is different than
diff test1.csv test2.csv > result.csv
I don't know which is the correct order but I want something else, both of the commands above will output something like
2 > 100,4,2,1,7 3 2,3c3,5 4 < 100,300,500,700 5 < 50,25,125,310 6 \ No newline at end of file 7 --- 8 > 21,22,23,24,25 9 > 50,25,125,310
I want to output only the difference, thus results.csv should look like this
100,300,500,700 100,4,2,1,7 21,22,23,24,25 50,25,700,5
I tried
diff -q
anddiff -s
but they didn't do the trick. Order doesn't matter, what matters is that I want to see only the difference, no > nor < nor blank space.grep -FvF
did the trick on smaller files not on big onesfirst file contains more than 5 million lines, second file contains 1300.
so results.csv should result in ~4,998,700 lines
I also tried
grep -F -x -v -f
which didn't work.-
Admin almost 9 years
-
Admin almost 9 years@Tim i saw your link and i'm an old member so i know the rules but was careless, sorry :) was editing it, and i saw a popup that the post was edited so you did the work for me and I'm thankful Sir.
-
Admin almost 9 yearsShould the order be preserved?
-
Admin almost 9 years@kos the order of numbers in each row yes, the way the rows are displayed, no i don't care about that
-
Admin almost 9 yearssort of depends on what you want to do with the information, diff, IMO, is for making a patch. At any rate, I am now sure of your best tool, diff, grep, awk, or perl .
-
Admin almost 9 yearsIf a line appears two times in the first file but none in the second file, should the line be displayed once or twice?
-
-
Oli almost 9 yearsWhy
cat
two redirected commands? Why not just run one then the other?grep ... ; grep ...
or{ grep ... ; grep ... ; }
if you wanted to do something with the collective output. -
heemayl almost 9 years@Oli Thanks..thats a great idea..i did not think of that..
-
kos almost 9 yearsYou haven't been fooled by the spaces after the
200,[...]
line huh? :) -
terdon almost 9 years@kos no, I removed trailing spaces from the files first. I assumed the OP's files don't actually have them.
-
Volker Siegel almost 9 yearsThat does not work if one file contains a line two times, that does not appear in the other file. Both occurences would be in a
diff
result. -
Peter Cordes almost 9 yearsYour awk command just re-implements
sort | uniq -u
, which gives the wrong answer when one file contains duplicate lines. For grep, I'd say "inner"/"outer", not "internal"/"external". -
muru almost 9 years@PeterCordes yes, it does and who are you to say that's the wrong result?
-
Peter Cordes almost 9 yearsWrong in the sense that it's not exactly what the question asked for, in that corner case. It might be what someone wants, but you should point out the difference between what your
awk
will print and what thecomm -3
anddiff
answers will print. -
muru almost 9 years@PeterCordes again, who are you to say that? Until the OP says that's what they want, I don't care if the output differs from that of
comm -3
. I don't see any reason why I should explain that. If you want to edit in a note, feel free. -
Peter Cordes almost 9 yearsThe OP said he wants the difference. That's not always what your program produces. A program that produces the same output for one testcase, but doesn't satisfy the description as written for all cases, requires a heads up. I'm here to say that, and it's true regardless of who I am or who you are. I added a note.
-
muru almost 9 years@PeterCordes who you are you is relevant, since that is a test case that the OP hasn't given. Until they give that test case, and then the output shows that you say is right, you are wrong.
-
Peter Cordes almost 9 yearsThe OP described in words that he wanted the DIFFERENCES between two files. You can't require every question to provide exhaustive testcases for every corner case of their clearly worded description.
-
muru almost 9 years@PeterCordes and neither can you come up with possible purely imaginary test cases. For all I know, neither file never has a duplicated line, and further, OP worded it in the simplest way for the case with no duplicate lines, never thinking of this imaginary situation. WRITING IT IN ALL CAPS DOESN'T MAKE YOU RIGHTER.
-
Peter Cordes almost 9 yearsSorry, I should have use italics for emphasis. So your position is that it doesn't matter if your program makes assumptions the OP didn't say you could? As long as your output matches the one testcase provided, it doesn't matter if the behaviour doesn't match the English-language description of the desired behaviour in some corner cases? I can't agree with that position, unless there's a big gain in efficiency or code size. Even then, I'd always mention any corner case I knew my code didn't handle the way I thought the OP wanted.
-
muru almost 9 years@PeterCordes my position is that I don't see your test case as a valid one, so whatever you say predicated on that test case is irrelevant.