Similarity between two data sets or arrays

10,187

Solution 1

"Distance" or "similarity" could refer to this type of problem.

Simply calculating the sum of absolute difference, as you've done, should work fairly well. This is called the Manhattan distance. In mathematical terms, it would be: x ∈ (a,b,c,d) Abs(x1 - x2).

Although the best measure really depends on what behaviour you want.

Ratio could potentially be a better idea.

Consider something like 1000000, 5, 5, 5 vs 999995, 5, 5, 5 and 1000000, 0, 5, 5.

According to the above formula, the first would have the same similarity to both the second and the third.

If this is not desired (as 999995 can be considered pretty close to 1000000, while 0 can be thought of as quite far from 5), you should divide by the maximum of the two when calculating each distance.

x ∈ (a,b,c,d) [ Abs(x1 - x2) / max(x1, x2) ]

This will put every number between 0 and 1, which is the percentage difference between the values.

This means that, for our above example, we'd consider 1000000, 5, 5, 5 and 999995, 5, 5, 5 to be very similar (since the above sum will be |1000000-999995|/1000000 + 0 + 0 + 0 = 0.000005) and 1000000, 5, 5, 5 and 1000000, 0, 5, 5 will be considered much more different (since the sum will be |0+5|/5 + 0 + 0 + 0 = 1).

If negative values are possible, the formula would need to be updated appropriately. You'd need to decide how you want to handle that based on the problem you're trying to solve. Should 10 to 0 be more or less different than (or equivalent to) 5 to -5?

Are elements interchangeable to any degree?

Consider something like A=1, B=2, C=3, D=4 and A=4, B=1, C=2, D=3.

While every individual element has changed, the set still consists of 1, 2, 3, 4 and each element is simply shifted by 1 position (apart from 4).

For some problems this isn't going to matter at all and the above wouldn't be all that different than going from A=1, B=11, C=21, D=31 to A=2, B=12, C=22, D=32. For other problems it could be quite relevant though.

For a sequence like a string or array, the idea of inserting, deleting or shifting elements could make sense. If so, you would want to look at edit distance, a common one of which would be Levenshtein distance. You might also want to think about modifying this to consider how much individual values differ by (but this would not be trivial).

For something like a set, elements are interchangeable, but there wouldn't really be a strict order on the elements ({1, 2, 3} is the same as {3, 1, 2}). If this is the case, the simplest might be to sort the values and just use edit distance. You may also be able to loop through both at the same time in some way, which would allow you to more easily take the differences between values into account.

Solution 2

Your problem reminds me of finding a Hamming distance. Basically, the Hamming distance between two objects is the number of elements in one object that must be changed to make it match the other object. There are similar measures as well (Damerau–Levenshtein distance, Euclidean distance, etc.).

You have a number of choices in how you implement this. For instance, is the distance between {1,3,4} and {1,7,4} 1 (because one element changed) or 4 (because of the magnitude of the change)? How you actually define the distance depends a lot on the context of your problem, and there's not necessarily a right answer.

Share:
10,187
Anders
Author by

Anders

I build things on the internet. Creator of passwordless.dev and many other things.

Updated on July 02, 2022

Comments

  • Anders
    Anders almost 2 years

    Let's say I have a dataset that look like this:

    {A:1, B:3, C:6, D:6}
    

    I also have a list of other sets to compare my specific set:

    {A:1, B:3, C:6, D:6},  
    {A:2, B:3, C:6, D:6},  
    {A:99, B:3, C:6, D:6},  
    {A:5, B:1, C:6, D:9},  
    {A:4, B:2, C:2, D:6}
    

    My entries could be visualized as a Table (with four columns, A, B, C, D, and E).

    How can I find the set with the most similarity? For this example, row 1 is a perfect match and row 2 is a close second, while row 3 is quite far away.

    I am thinking of calculating a simple delta, for example: Abs(a1 - a2) + Abs(b1 - b2) + etc and perhaps get a correlation value for the entries with the best deltas.

    Is this a valid way? And what is the name of this problem?