Hamming distance on binary strings in SQL

10,593

Solution 1

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.


FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

Solution 2

Interesting question, I've found a way to do this for a binary(3) that might work as well for a binary(32):

drop table if exists BinaryTest;
create table  BinaryTest (hash binary(3));
insert BinaryTest values (0xAAAAAA);

set @supplied = cast(0x888888 as binary);

select  length(replace(concat(
            bin(ascii(substr(hash,1,1)) ^ ascii(substr(@supplied,1,1))),
            bin(ascii(substr(hash,2,1)) ^ ascii(substr(@supplied,2,1))),
            bin(ascii(substr(hash,3,1)) ^ ascii(substr(@supplied,3,1)))
        ),'0',''))
from    BinaryTest;

The replace removes any all zeroes, and the length of remainder is the number of ones. (The conversion to binary omits leading zeroes, so counting the zeroes would not work.)

This prints 6, which matches the number of ones in

0xAAAAAA ^ 0x888888 = 0x222222 = 0b1000100010001000100010
Share:
10,593

Related videos on Youtube

CAFxX
Author by

CAFxX

Yak-shaving for a living

Updated on May 17, 2022

Comments

  • CAFxX
    CAFxX about 2 years

    I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I'm looking for a way to compute the Hamming distance of the entries in the column to a supplied value, i.e. something like:

    SELECT * FROM table 
      ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC 
      LIMIT 10
    

    (in case you're wondering, the Hamming distance of strings A and B is defined as BIT_COUNT(A^B), where ^ is the bitwise XOR operator and BIT_COUNT returns the number of 1s in the binary string).

    Now, I know that both the ^ operator and BIT_COUNT function only work on INTEGERs and so I'd say that probably the only way to do it would be to break up the binary strings in substrings, cast each binary substring to integer, compute the Hamming distance substring-wise and then add them. The problem with this is that it sounds terribly complicated, not efficient and definitely not elegant. My question therefore is: could you suggest any better way? (please note that I'm on shared hosting and therefore I can't modify the DB server or load libraries)

    edit(1): Obviously loading the whole table in PHP and doing the computations there would be possible but I'd rather avoid it because this table will probably grow quite large.

    edit(2): The DB server is MySQL 5.1

    edit(3): My answer below contains the code that I just described above.

    edit(4): I just found out that using 4 BIGINTs to store the hash instead of a BINARY(32) yields massive speed improvements (more than 100 times faster). See the comments to my answer below.

    • CAFxX
      CAFxX over 13 years
      Feel free also to suggest different ways to store the hashes if this could prove useful in finding a better solution.
    • Andomar
      Andomar over 13 years
      If you'd store the hash in 8 integers (perhaps in addition to the binary storage), the calculation becomes much easier.
    • Nanne
      Nanne over 13 years
      I'm really curious as for why you would want to calculate the distance :)
    • CAFxX
      CAFxX over 13 years
      I'm working on a distributed hash table, and that step is needed to find keys on other nodes (i.e. if you don't have a key locally, you forward the query to the nodes whose ID is closer to the key - where closer is measured in terms of the Hamming distance); obviously both IDs and keys are of the same length (256 bit in my case).
    • pascal
      pascal over 13 years
      What do you mean by "the table will probably grow quite large"? Will a full scan with the adequate function be acceptable, or do you need a way to use some indexes and avoid the full scan?
    • CAFxX
      CAFxX over 13 years
      I mean that I'd rather avoid fetching the whole table in PHP and then choosing there the 10 entries I need (see the "LIMIT 10" clause in the code above) if I can do it on the SQL server.
    • TomSawyer
      TomSawyer over 10 years
      i'm trying to do hamming distance on mysql with phasher. i just tried this query but it seems not return nearest record with similar signature. can you explain it?
    • CAFxX
      CAFxX over 10 years
      @TomSawyer I'm not psychic, so I can't tell what's wrong in your code without looking. In my case it worked pretty well (see my answer below).
    • Philippe Ombredanne
      Philippe Ombredanne over 6 years
      See also this related answer: you could expand on it to make all the computations in the DB as functions stackoverflow.com/a/47487949/302521
  • CAFxX
    CAFxX over 13 years
    I just ran some tests: running the query in the original question using the function as defined here on a table with 100000 rows takes somewhere around 2.5s. Since I don't really need the exact answer to my query, I can just sample the table by adding a WHERE RAND() < 0.05 (to sample a random 5% of the table) and this brings the time down to 0.2s. Still, if some SQL guru out there can point out a better way to do it, I'd love to hear it.
  • CAFxX
    CAFxX over 13 years
    Other test: I created a view that trasnforms each BINARY(32) to four BIGINTs. This lowers the time from 2.5s to 0.6s.
  • CAFxX
    CAFxX over 13 years
    OK, I found out that if I actually use a table where I store the hash as 4 BIGINTs, the same query completes in 0.02s. Defintely using BINARY(32) is a Bad Idea(TM).
  • TomSawyer
    TomSawyer over 10 years
    hi there , i'm trying to do hamming distance with mysql too. how exactly query to search relevance records with hash string?
  • whoughton
    whoughton over 9 years
    Did you ever try using the BIT column type? dev.mysql.com/doc/refman/5.0/en/bit-type.html
  • Philippe Ombredanne
    Philippe Ombredanne over 6 years
    Splitting the binary in ints has also other interesting applications for speeding things up even further when you want to find hamming distance less than a certain value. See stackoverflow.com/a/47487949/302521