Percentage value with GNU Diff

15,263

Solution 1

Something like this perhaps?

Two files, A1 and A2.

$ sdiff -B -b -s A1 A2 | wc would give you how many lines differed. wc gives total, just divide.

The -b and -B are to ignore blanks and blank lines, and -s says to suppress the common lines.

Solution 2

https://superuser.com/questions/347560/is-there-a-tool-to-measure-file-difference-percentage has a neat solution for this,

wdiff -s file1.txt file2.txt

more options see man wdiff.

Solution 3

Here's a script that will compare all .txt files and display the ones that have more than 15% duplication:

#!/bin/bash

# walk through all files in the current dir (and subdirs)
# and compare them with other files, showing percentage
# of duplication.

# which type files to compare?
# (wouldn't make sense to compare binary formats)
ext="txt"

# support filenames with spaces:
IFS=$(echo -en "\n\b")

working_dir="$PWD"
working_dir_name=$(echo $working_dir | sed 's|.*/||')
all_files="$working_dir/../$working_dir_name-filelist.txt"
remaining_files="$working_dir/../$working_dir_name-remaining.txt"

# get information about files:
find -type f -print0 | xargs -0 stat -c "%s %n" | grep -v "/\." | \
     grep "\.$ext" | sort -nr > $all_files

cp $all_files $remaining_files

while read string; do
    fileA=$(echo $string | sed 's/.[^.]*\./\./')
    tail -n +2 "$remaining_files" > $remaining_files.temp
    mv $remaining_files.temp $remaining_files
    # remove empty lines since they produce false positives
    sed '/^$/d' $fileA > tempA

    echo Comparing $fileA with other files...

    while read string; do
        fileB=$(echo $string | sed 's/.[^.]*\./\./')
        sed '/^$/d' $fileB > tempB
        A_len=$(cat tempA | wc -l)
        B_len=$(cat tempB | wc -l)

        differences=$(sdiff -B -s tempA tempB | wc -l)
        common=$(expr $A_len - $differences)

        percentage=$(echo "100 * $common / $B_len" | bc)
        if [[ $percentage -gt 15 ]]; then
            echo "  $percentage% duplication in" \
                 "$(echo $fileB | sed 's|\./||')"
        fi
    done < "$remaining_files"
    echo " "
done < "$all_files"

rm tempA
rm tempB
rm $all_files
rm $remaining_files
Share:
15,263

Related videos on Youtube

cdated
Author by

cdated

I like shells.

Updated on April 20, 2022

Comments

  • cdated
    cdated about 2 years

    What is a good method for using diff to show a percentage difference between two files?

    Such as if a file has 100 lines and a copy has 15 lines that have been changed the diff-percent would be 15%.

    • MJB
      MJB about 14 years
      You could use sdiff and count the separators, then divide by the number of lines.
  • cdated
    cdated about 14 years
    from man file of wc: newlines, words, and bytes. you divide the first number in the output by the number of lines in the file are you comparing. wc -l, gives you only number of lines and can be added to the command above. -- Response to AlligatorJack
  • MJB
    MJB about 14 years
    @cdated : thanks for clarifying. I did not see the question/response of course until you commented.
  • Sridhar Sarnobat
    Sridhar Sarnobat about 3 years
    I guess this only works for text as opposed to media like videos which I keep getting near-duplicates of (due to youtube-dl changing).