How to compare two big files and get results to third file?

16,066

Solution 1

Use comm(1) to compare two sorted files and to give the differences. Use grep(1) and sort(1) to get your files into an input format suitable for comparison with comm. Use process substitution in bash to tie it together:

comm -23 <(sort file1.txt) <(grep -o '^[^;]*' file2.txt | sort)

The -23 argument to comm says to ignore lines that are common to both files (-3) and lines unique to file 2 (-2). Depending on your exact specification, you can use -1, -2 or -3.

grep -o '^[^;]*' file2.txt just strips off everything after the first semicolon. You can use sed(1) for this, but if you are only extracting part of a line and not adding anything else, grep will often be faster.

comm needs the input files to be sorted, so sort is used to do that. The output will be sorted. sort uses locale specific collation, so you may need to set LC_ALL=C depending on the exact collation you want.

Note in your question you have www.other-domain in file 2, but www.other-domain.com in file 1. I have assumed that it is a typo in file 2 given the output.

This runs all the processes in parallel and streams the file data through them, so even if the files are large, it will not take up a lot of memory or any extra disk space to store temporary files.

Solution 2

If the input in file2 contains a subset of the contents of file1, you could just

sed 's/;.*//' file2 | fgrep -vxf - file1 >not-in-file2

The same general idea can be applied to diff or comm. However, comm requires sorted input, but if that is not a problem (or if your data can be sorted to start with), just preprocess the data from file2.

sed 's/;.*//' file2.sorted | comm -12 - file1.sorted >cmp.out

The constraint that input needs to be sorted is what allows comm to handle really large files, because it just needs to keep the latest data in memory at any one time. You could do the same with your own custom awk script.

Share:
16,066
Admin
Author by

Admin

Updated on June 05, 2022

Comments

  • Admin
    Admin almost 2 years

    I have two files

    1st file is like this:

    www.example.com
    www.domain.com
    www.otherexample.com
    www.other-domain.com
    www.other-example.com
    www.exa-ample.com
    

    2nd file is like this (numbers after ;;; are between 0-10):

    www.example.com;;;2
    www.domain.com;;;5
    www.other-domain;;;0
    www.exa-ample.com;;;4
    

    and i want compare these two files and output to third file like this:

    www.otherexample.com
    www.other-example.com
    

    Both files have large size (over 500mb)