How to compare two big files and get results to third file?
Solution 1
Use comm(1)
to compare two sorted files and to give the differences. Use grep(1)
and sort(1)
to get your files into an input format suitable for comparison with comm
. Use process substitution in bash
to tie it together:
comm -23 <(sort file1.txt) <(grep -o '^[^;]*' file2.txt | sort)
The -23
argument to comm
says to ignore lines that are common to both files (-3
) and lines unique to file 2 (-2
). Depending on your exact specification, you can use -1
, -2
or -3
.
grep -o '^[^;]*' file2.txt
just strips off everything after the first semicolon. You can use sed(1)
for this, but if you are only extracting part of a line and not adding anything else, grep
will often be faster.
comm
needs the input files to be sorted, so sort
is used to do that. The output will be sorted. sort
uses locale specific collation, so you may need to set LC_ALL=C depending on the exact collation you want.
Note in your question you have www.other-domain in file 2, but www.other-domain.com in file 1. I have assumed that it is a typo in file 2 given the output.
This runs all the processes in parallel and streams the file data through them, so even if the files are large, it will not take up a lot of memory or any extra disk space to store temporary files.
Solution 2
If the input in file2
contains a subset of the contents of file1
, you could just
sed 's/;.*//' file2 | fgrep -vxf - file1 >not-in-file2
The same general idea can be applied to diff
or comm
. However, comm
requires sorted input, but if that is not a problem (or if your data can be sorted to start with), just preprocess the data from file2
.
sed 's/;.*//' file2.sorted | comm -12 - file1.sorted >cmp.out
The constraint that input needs to be sorted is what allows comm
to handle really large files, because it just needs to keep the latest data in memory at any one time. You could do the same with your own custom awk
script.
Admin
Updated on June 05, 2022Comments
-
Admin almost 2 years
I have two files
1st file is like this:
www.example.com www.domain.com www.otherexample.com www.other-domain.com www.other-example.com www.exa-ample.com
2nd file is like this (numbers after ;;; are between 0-10):
www.example.com;;;2 www.domain.com;;;5 www.other-domain;;;0 www.exa-ample.com;;;4
and i want compare these two files and output to third file like this:
www.otherexample.com www.other-example.com
Both files have large size (over 500mb)