Compare two large datasets

Wim_Leys · February 18, 2020, 5:46pm

We have two tables with about 50 million records (one in OLTP the other in the DWH). There are about 2000 records less in one table. We try to find out what the differences are, maybe that will help with the debugging.

We had a discussion how to find out which records are in table A and not in table B and visa versa. FULL OUTER JOIN, EXCEPT, ...

How would you do it?

yosiasz · February 18, 2020, 5:53pm

Please post the columns of the tables you are using to do your comparisons

Wim_Leys · February 18, 2020, 6:02pm

Hi Yosiasz,

CREATE TABLE dbo.attendances(
	id int IDENTITY(1,1) NOT NULL,
	lesson_id int NULL,
	student_id int NULL,
	code nvarchar(255) NULL,
	created_at datetime NULL,
	updated_at datetime NULL,
	external_reference nvarchar(255) NULL,
	reference int NULL,
	origin nvarchar(255) NULL,
	master nvarchar(255) NULL,
	identity_hash nvarchar(255) NULL,
	request_hash nvarchar(255) NULL,
	attributes_hash nvarchar(255) NULL,
	unique_reference nvarchar(255) NULL,
	subscription_id int NULL,
	centre_id int NOT NULL,
	destroyable bit NOT NULL
)

You may assume that the schema of the corresponding table in the DWH is the same.

Schools sometimes UPDATE a record with another centre, course, student, ... While the designers of the DWH expected the schools would only update the attendance code.

ScottPletcher · February 18, 2020, 6:40pm

What key(s) indicate corresponding rows? Is it id or something else?

yosiasz · February 18, 2020, 7:26pm

how can the schema be the same in both oltp and DW, that just makes no sense. if that is the case what is the use of a DW.
@Wim_Leys please answer the question @ScottPletcher and I asked you if you need us to answer properly. We could guess and provide you some bootleg answer which you will try and come back and ask again and again for trial and error.

Wim_Leys · February 19, 2020, 7:44am

Hi Yosiasz,

The schemas indeed are not the same. I just wanted to simplify the whole thing as I was expecting a bunch of additional questions about the dimension keys, is_deleted flag, is_current, ... etc.

So maybe just forget we have to compare an OLTP table with a DWH table.
Suppose we have two identical (schema) tables with millions of rows: table A is the master of table B. They should be totally equal, but we know there is

a difference in the number of records (records that only occur in table A and maybe other records that only occur in table B (due to a delete in table A that was not well reflected in table B))
and we also expect that for records with the same id, the master and slave record may have different values in some columns. (due to updating not only the attendance code, but also pass an entirely different student, course, lesson, ...)

id int IDENTITY(1,1) NOT NULL

is the primary key and can (should) be used to find the corresponding record in the other table.

If the tables would not host millions of records, I would use a FULL OUTER JOIN or an EXCEPT to find differences. I wonder what the most efficient way is in this situation.

PurpleLady · February 20, 2020, 1:29pm

I've done this many moons ago:) I used checksum - if they were the same it matched. But I believe that your best bet would be EXCEPT
SELECT * from tableA
EXCEPT
SELECT * from tableB

Fred_unique · February 20, 2020, 2:42pm

That would show records in tableA that are not in tableB. The OP believes there also may be records in tableB that are not in tableA.

PurpleLady · February 20, 2020, 2:59pm

Then add a UNION ALL SELECT * FROM tableB EXECPT SELECT * FROM tableA

Wim_Leys · February 21, 2020, 9:57am

PurpleLady, thank you for your answer!

JeffModen · February 23, 2020, 10:26pm

The first thing I'd do is to find out what went wrong because, unless you fix that, the problem will occur again... and again... and again.

Also, EXCEPT is actually quite expensive and relatively slow. Consider making a hash of each row and comparing those instead. Making the can hash is surprisingly fast although it does have the disadvantage of not being able to tell you what was different by column. Of course, EXCEPT has a similar issue.

Once you get the hash down, consider leaving it as a part of the tables so that you can easily do checks on the tables to more easily verify that nothing has gone wrong again in the future... especially if you haven't figured out what went wrong this first time.

As a bit of a sidebar, I'd also trade in the "IsActive" column for a nice SCD Type 6 setup but without the "Current Flag". Just use the dates. That will also allow you determine the condition of data at virtually any given point of time. SCD Type 6 is also the fundamental idea behind Temporal Tables in SQL Server (starting in 2016).

yosiasz · February 24, 2020, 1:28am

I need to know more about this vs isActive. Are you talking for the data warehouse side?

JeffModen · February 24, 2020, 3:44am

Yes but it also works great for audit tables whether you use something homegrown to accomplish SCD Type 6 (or just type 2 can work a treat without a history table) or Temporal Tables.

sonusingh · February 24, 2020, 9:35am

SQL Data Compare is a great tool for doing this. Also Microsoft Visual Studio SQL Server Data Tools has a Data Compare function.

JeffModen · February 24, 2020, 12:52pm

I'm not so sure about SQL Data Compare or the Visual Studio tools... I believe they have to be identical tables and, although the OP said that we should "pretend" they are to keep the post simple, they are not actually identical tables.

JeffModen · February 24, 2020, 1:01pm

@yosiasz,

Just a bit more compared to the "IsActive" column...

SCDs of the nature that I'm talking about have a start and end date for every row. The currently active row has a date in the year '9999' for an end date. That makes life REAL easy because you no longer need to do a search for NULL in dates combined with non-nulls and you don't need to mess up indexes with low cardinality junk like an "IsActive" column, which can also cause a lot of data movement in indexes when a bunch of rows change their state.

Wim_Leys · February 24, 2020, 1:02pm

Hi Jeff,

Thank you for the feedback.

The DWH is modelled with SCD Type 2 tables.

We planned to do the compare this week. I'll try both the EXCEPT and FULL OUTER JOIN variants.

JeffModen · February 24, 2020, 1:59pm

For a one-off, either of those will probably be fine but, if you need to do such a comparison on a regular basis, you might want to consider the hashing I spoke of. I took a comparison of two 500K row tables down from several minutes using EXCEPT to just under 6 seconds using the hash method. It takes a bit of setup work but it was well worth it.

Wim_Leys · February 24, 2020, 4:06pm

Hi Jeff,

We run a "consistency" check after each and every ETL run that compares the number of records in certain tables between the OLTP and the DWH. That's how we found this anomaly. Comparing two large datasets is not a daily chore.
We are used to use hashing as a standard method to compare records. Setting that up won't be a problem.
I hope the differences between the tables will give us a clue where the bug is located.

Thank you for the help.