Compare 2 big dataset for all rows and columns

SQLDUMMY · October 11, 2022, 9:24pm

Hi ,
I have table A with AUDIT_ID 1 &AUDIT_ID 2 that contains over 300 columns and 3683508 rows. I want to compare all data in table A with AUDIT_ID 1 and data in table A with AUDIT_ID 2. How do I compare values for each row and column? Not able to perform the below query due to the large dataset.

select *
from table a
where idp_audit_id=1
minus
select *
from table a
where idp_audit_id=2

Thanks.

robert_volk · October 11, 2022, 9:34pm

Are you getting an error when you try to run the query? Or does it time out? What specifically is keeping you from running it?
SQLTeam.com is a Microsoft SQL Server site, it looks like you're using Oracle, so our advice may not work for you.

SQLDUMMY · October 11, 2022, 9:52pm

It’s taking a long time and it shut down…

robert_volk · October 12, 2022, 10:38am

The query you provided is correct, whatever the problem is seems to be due to the size of the data it's comparing. One thing I missed earlier, if you include idp_audit_id in the SELECT list, then you'll get all the rows for each value, and it's not actually comparing them.

Some thoughts on how to proceed:

Try SELECTing only a few columns instead of SELECT *, this should lower memory usage. Make sure to exclude idp_audit_id from the SELECT clause, but keep the same WHERE clauses. If this works, you can run it again with a different set of columns. You will also need to exclude any column or columns that make up a primary key or unique constraint.
If Oracle has some kind of hashing function that can apply to all of the columns, you can use it to hash a new value that in essence combines all that data into one column, but is smaller to process. You must exclude all of the column(s) of the primary key or other unique constraints. They will make the hash unique as well, and won't properly compare.

harishgg1 · October 12, 2022, 10:57am

hi

one idea is indexes

another idea is splitting the data into small parts .. do your work ... put them back together

SQLDUMMY · October 12, 2022, 11:37am

Thanks. How to use the hashing function?

robert_volk · October 12, 2022, 1:31pm

In SQL Server:

SELECT HASHBYTES('SHA2_256', CONCAT(col1, '`', col2, '`', col3))
FROM table where idp_audit_id=1
EXCEPT -- same as MINUS
SELECT HASHBYTES('SHA2_256', CONCAT(col1, '`', col2, '`', col3))
FROM table where idp_audit_id=2

You'd combine/concatenate columns with a separator character (I'm using ` in the example) and then hash that combination. The idea is that if the rows have the same data, they'll generate the same hash. If the hashes are different, you can then go back and compare individual columns. The idea is to reduce the number of rows where you have to compare each column, since you're currently running into problems with the existing volume.

Edit: I mentioned earlier about excluding key columns, that applies to the hash function as well, but you'll also need to include some non-key columns in the SELECT clause that are not hashed. Sorry if I'm not explaining this well.