SQL way to get the MD5 or SHA1 of an entire row

11,925

Solution 1

You could calculate the hashbytes value for the entire row on an update trigger, I used this as part of an ETL process where previously they were comparing all columns in the tables, the speed increase was huge.

Hashbytes works on varchar, nvarchar, or varbinary datatypes, and I wanted to compare integer keys and text fields, casting everything would have been a nightmare, so I used the FOR XML clause in SQL server as follows:

CREATE TRIGGER get_hash_value ON staging_table
FOR UPDATE, INSERT AS  
UPDATE staging_table
SET sha1_hash = (SELECT hashbytes('sha1', (SELECT col1, col2, col3 FOR XML RAW)))
GO

alternatively, you could calculate the values in a similar way outside of a trigger, if you plan to do many updates on all the rows by using a subquery with the for xml clause also. If going this route, you can even change it to a SELECT *, but not in the trigger, as each time you run it you would be getting a different value because the sha1_hash column would be different each time.

You could modify the select statement to get more than 1 row

Solution 2

In MSSQL -- You can use HashBytes across the entire row by using xml..

SELECT MBT.id,
   hashbytes('MD5',
               (SELECT MBT.*
                FROM (
                      VALUES(NULL))foo(bar)
                FOR xml auto)) AS [Hash]
FROM <Table> AS MBT;

You need the from (values(null))foo(bar) clause to use xml auto, it serves no other purpose..

Share:
11,925
Pierre D
Author by

Pierre D

...from banging assembly code in the early eighties to crunching petabytes nowadays, I love Computer Science and things that go fast.

Updated on July 19, 2022

Comments

  • Pierre D
    Pierre D almost 2 years

    Is there a "semi-portable" way to get the md5() or the sha1() of an entire row? (Or better, of an entire group of rows ordered by all their fields, i.e. order by 1,2,3,...,n)? Unfortunately not all DBs are PostgreSQL... I have to deal with at least microsoft SQL server, Sybase, and Oracle.

    Ideally, I'd like to have an aggregator (server side) and use it to detect changes in groups of rows. For example, in tables that have some timestamp column, I'd like to store a unique signature for, say, each month. Then I could quickly detect months that have changed since my last visit (I am mirrorring certain tables to a server running Greenplum) and re-load those.

    I've looked at a few options, e.g. checksum(*) in tsql (horror: it's very collision-prone, since it's based on a bunch of XORs and 32-bit values), and hashbytes('MD5', field), but the latter can't be applied to an entire row. And that would give me a solution just for one of the SQL flavors I have to deal with.

    Any idea? Even for just one of the SQL idioms mentioned above, that would be great.