Binary chechsum to prevent duplicate rows

Soli · June 24, 2024, 2:18pm

Greetings

I have used BINARY_CHECKSUM to check for any possible existing duplicate data. Is there a more robus method of doing that or would a simple NOT IN work out as good as BINARY_CHECKSUM

here is sample example that uses the binary_checksum check

with uno
as
(
select N'የተሰናባቹ ፕሬዚዳንት ዶናልድ ትረምፕ 800 የሚሆኑ ደጋፊዎች' as geez
), dos as 
(
select N'የተሰናባቹ ፕሬዚዳንት ዶናልድ ትረምፕ 800 የሚሆኑ ደጋፊዎች።' geez
)

select BINARY_CHECKSUM(uno.geez), 
BINARY_CHECKSUM(dos.geez) ,
case when BINARY_CHECKSUM(uno.geez) <> BINARY_CHECKSUM(dos.geez) then 'true'
else 'false' end as diff
From uno
cross apply dos

Thanks

robert_volk · June 24, 2024, 10:14pm

Comparing character data for uniqueness could depend on which collation you're using, if it's case-insensitive, accent-insensitive, etc., you might get equalities that binary_checksum would not.

Some of the characteristics are described here:

I have had some experience with unwanted duplicates/collisions using binary_checksum, that I did not get using the checksum() function, which I could not understand but nevertheless had to deal with. If you want to reduce the chance of collisions with either of these functions, you should consider salting them, e.g. appending or prepending some fixed data, and storing 2 checksums, one salted, the other unsalted, and comparing both of them.

There is also the HASHBYTES function:

That offers a larger return value, depending on the algorithm you choose, and less likely to encounter collisions. Unfortunately it doesn't support the CHECKSUM(*) format, if you want to hash multiple columns, you have to concatenate them with some kind of delimiter. HASHBYTES is also limited in the data types in can accept, you may have to convert values to binary before they will hash.

JamesK · June 25, 2024, 9:32am

Thank you for the idea of salting, Robert. That never occurred to me.

What I have done in the past is to compute both binary_checksum and hashbytes (after converting and concatenating for multiple columns). That conversion and concatenation is not a pleasant process, especially when high precision decimals and floating point numbers are involved and you want to detect small changes in those numbers.