Compare two large datasets

We have two tables with about 50 million records (one in OLTP the other in the DWH). There are about 2000 records less in one table. We try to find out what the differences are, maybe that will help with the debugging.

We had a discussion how to find out which records are in table A and not in table B and visa versa. FULL OUTER JOIN, EXCEPT, ...

How would you do it?

Please post the columns of the tables you are using to do your comparisons

Hi Yosiasz,

CREATE TABLE dbo.attendances(
	id int IDENTITY(1,1) NOT NULL,
	lesson_id int NULL,
	student_id int NULL,
	code nvarchar(255) NULL,
	created_at datetime NULL,
	updated_at datetime NULL,
	external_reference nvarchar(255) NULL,
	reference int NULL,
	origin nvarchar(255) NULL,
	master nvarchar(255) NULL,
	identity_hash nvarchar(255) NULL,
	request_hash nvarchar(255) NULL,
	attributes_hash nvarchar(255) NULL,
	unique_reference nvarchar(255) NULL,
	subscription_id int NULL,
	centre_id int NOT NULL,
	destroyable bit NOT NULL
)

You may assume that the schema of the corresponding table in the DWH is the same.

Schools sometimes UPDATE a record with another centre, course, student, ... While the designers of the DWH expected the schools would only update the attendance code.

What key(s) indicate corresponding rows? Is it id or something else?

1 Like

how can the schema be the same in both oltp and DW, that just makes no sense. if that is the case what is the use of a DW.
@Wim_Leys please answer the question @ScottPletcher and I asked you if you need us to answer properly. We could guess and provide you some bootleg answer which you will try and come back and ask again and again for trial and error.

Hi Yosiasz,

The schemas indeed are not the same. I just wanted to simplify the whole thing as I was expecting a bunch of additional questions about the dimension keys, is_deleted flag, is_current, ... etc.

So maybe just forget we have to compare an OLTP table with a DWH table.
Suppose we have two identical (schema) tables with millions of rows: table A is the master of table B. They should be totally equal, but we know there is

  • a difference in the number of records (records that only occur in table A and maybe other records that only occur in table B (due to a delete in table A that was not well reflected in table B))
  • and we also expect that for records with the same id, the master and slave record may have different values in some columns. (due to updating not only the attendance code, but also pass an entirely different student, course, lesson, ...)
id int IDENTITY(1,1) NOT NULL

is the primary key and can (should) be used to find the corresponding record in the other table.

If the tables would not host millions of records, I would use a FULL OUTER JOIN or an EXCEPT to find differences. I wonder what the most efficient way is in this situation.

I've done this many moons ago:) I used checksum - if they were the same it matched. But I believe that your best bet would be EXCEPT
SELECT * from tableA
EXCEPT
SELECT * from tableB

1 Like

That would show records in tableA that are not in tableB. The OP believes there also may be records in tableB that are not in tableA.

Then add a UNION ALL SELECT * FROM tableB EXECPT SELECT * FROM tableA

PurpleLady, thank you for your answer!

The first thing I'd do is to find out what went wrong because, unless you fix that, the problem will occur again... and again... and again.

Also, EXCEPT is actually quite expensive and relatively slow. Consider making a hash of each row and comparing those instead. Making the can hash is surprisingly fast although it does have the disadvantage of not being able to tell you what was different by column. Of course, EXCEPT has a similar issue.

Once you get the hash down, consider leaving it as a part of the tables so that you can easily do checks on the tables to more easily verify that nothing has gone wrong again in the future... especially if you haven't figured out what went wrong this first time.

As a bit of a sidebar, I'd also trade in the "IsActive" column for a nice SCD Type 6 setup but without the "Current Flag". Just use the dates. That will also allow you determine the condition of data at virtually any given point of time. SCD Type 6 is also the fundamental idea behind Temporal Tables in SQL Server (starting in 2016).

I need to know more about this vs isActive. Are you talking for the data warehouse side?

Yes but it also works great for audit tables whether you use something homegrown to accomplish SCD Type 6 (or just type 2 can work a treat without a history table) or Temporal Tables.

SQL Data Compare is a great tool for doing this. Also Microsoft Visual Studio SQL Server Data Tools has a Data Compare function.

I'm not so sure about SQL Data Compare or the Visual Studio tools... I believe they have to be identical tables and, although the OP said that we should "pretend" they are to keep the post simple, they are not actually identical tables.

@yosiasz,

Just a bit more compared to the "IsActive" column...

SCDs of the nature that I'm talking about have a start and end date for every row. The currently active row has a date in the year '9999' for an end date. That makes life REAL easy because you no longer need to do a search for NULL in dates combined with non-nulls and you don't need to mess up indexes with low cardinality junk like an "IsActive" column, which can also cause a lot of data movement in indexes when a bunch of rows change their state.

1 Like

Hi Jeff,

Thank you for the feedback.

The DWH is modelled with SCD Type 2 tables.

We planned to do the compare this week. I'll try both the EXCEPT and FULL OUTER JOIN variants.

For a one-off, either of those will probably be fine but, if you need to do such a comparison on a regular basis, you might want to consider the hashing I spoke of. I took a comparison of two 500K row tables down from several minutes using EXCEPT to just under 6 seconds using the hash method. It takes a bit of setup work but it was well worth it.

Hi Jeff,

We run a "consistency" check after each and every ETL run that compares the number of records in certain tables between the OLTP and the DWH. That's how we found this anomaly. Comparing two large datasets is not a daily chore.
We are used to use hashing as a standard method to compare records. Setting that up won't be a problem.
I hope the differences between the tables will give us a clue where the bug is located.

Thank you for the help.