Quickest way to Pageinate Data?

yosiasz · February 6, 2017, 5:34pm

Honestly though...how common are these false positives (empirical data please showing tests results) and even in that specific case(Elastic Search) what if a couple of delta rows make it to ES

si? oui? Ja?

JeffModen · February 7, 2017, 2:39pm

To be sure, I do agree that false positives using checksums, especially across many columns, are going to be a rare event. That, notwithstanding, the key here is that it doesn't matter how often false positives can happen The key is what effect they have if they do occur. Will it lead to an error on a report? Will it cause someone to make a bad decision? Will it affect a customer?

Checksums should never be used to prove that a duplicate exists. They should only be used to prove that a duplicate does not exist. If checksum indicates a duplicate, then a separate, more expensive method should be used against (only) those items/rows that have matching checksums to determine if a duplicate actually does exist.

Here's a simple example of how checksum can cause a false positive and notice that the values aren't really that much different and that can certainly increase the chances of false positives when similar data is present.

 SELECT  CS1 = CHECKSUM('A352KD')
        ,CS2 = CHECKSUM('A352NT')
;

If duplicate data is a bad thing, then, empirical data or not, the code must prevent them.

Kristen · February 7, 2017, 3:24pm

They do actually exist in the real world - if you compare a reasonable sized database and rely on CHECKSUM to find all "different rows" you will miss some. Whether that matter is a case-by-case thing, of course, but for me if we are synchronising two databases then I want 100%

What might, well, be worthwhile is to use CHECKSUM daily to find differences and then Whole Database Table Compare at weekend to find any that were missed.

We get some that are missed anyway and I'd love to have a better way of "finding" those. Example:

Lets say that we require that the Country Code on a Customer Account is valid, and the Accounts Package we are importing from is JUNK and doesn't care.

So we reject a Customer Record Import, because Country Code is invalid (and log a suitable error which the user will see & sort out).

We also reject all the, brand new, transactions for that new Customer (because the Customer Account does not exist Natch!). We log that error too.

The user fixes the Customer Record with a valid Country Code, it gets a "Now" ChangeDateTime.

Next import loads the missing Customer Record because it has a "new" ChangeDateTime.

However, the transactions are still missing, because none of them have a "new" ChangeDateTime

I don't have an answer to this (other that to "force" all tables, all rows, periodically - but that isn't scaleable ...)

My only other thought is that because our Data Import Error Logging table contains the Table Name and PKey of the "faulty" records, therefore perhaps the Import routines should do both "All rows with ChangeDate after @LastTme" and also "Any PKey in the Error Logging table"

Controversial point:

We have an IDENTITY column (and Unique Index / PKey, sometimes its the Clustered Index, but we don't do "can't be bothered, make it the Clustered PKey anyway" )

The presence of a single-column, narrow-value (choose either INT or BIGINT depending on Project Size) means that we can easily "reference" any row, in any table, - e.g.. in our Error Logging table - without having to worry about multiple-part-keys on tables which are made up of X-datatypes and Y-width columns.

So we have IDENTITY in every table, whether its needed or not, because sooner or later there is some sort of cleanup or cross-reference or somesuch where it is much easier to either reference the rows with a simple INT or to mechanically generate code that does something useful and only has to worry about being able to reference the rows, uniquely, with a single INT value and not a 15-part-key !!

(We do actually make the IDENTITY the PKey because it is easier for generic code to figure out which column is the IDENTITY and we are lazy that has nothing to do with Clustered Index of course)

yosiasz · February 7, 2017, 5:14pm

Pretty sure somebody on earlier thread said but now suddenly a column named ChangeDateTime magically appears.

Ok I meant to say why not use BINARY_CHECKSUM not just vanilla CHECKSUM or is that not make any difference. or maybe even HASHBYTES?
Are you using a synch tables log table. We had a table that kept track of syncing our SOLR/elastic search/couchbase containers..
We had all sorts of buckets of data we need to push so our table was as follows not sure if it helps

Table SyncTracker
SyncSource, LastSyncDateTime,TargetContainer
Products,2017-02-07 09:05:41.320, SOLR
Customer Dashboard,2017-02-07 08:05:41.320, Elastic Search
Sales,2017-02-07 07:05:41.320, CouchBase

or you could design this to be more generic like 
ItemType, ItemPK
Products, 1

This above design might require sort of dynamic query?

So next time your update process runs, it looks for any rows in Products table where ChangeDateTime > '2017-02-07 09:05:41.320'

insert into dbo.TableYourSynchProcessCodeGrabsDataFrom
select PK_ProductID
from dbo.Products p
join dbo.SyncTracker st on p.ChangeDateTime >= st.LastSyncDateTime
where st.SyncSource = 'Products'

This is something like what we did for our environment

Kristen · February 7, 2017, 6:52pm

Just need to download that magic-patch for all 3rd party databases

The problem (false positives) is still there. I expect its possible to predict the %age failure, but it is still not 100% safe (in terms of being 100% sure all differences have been found). I still need 100% found (for the stuff that we do).

if you miss a row, and its used in a FKey relationship what then? more pain and suffering ...

Sorry not sure what that is?

Some sort of trigger that stores the changed data in a log?

If I could put a Trigger on the 3rd party DB then just storing PKey and ChangeDateTime in some other "ThisIsWhatChanged" table would do me, of course.

Perhaps I could use some sort of replication to get a "feed" of all changes, but not actually store the full rows in the remote DB - just store the PKey and ChangeDateTime ? That might do too (and replication might not be seen as a "database modification" to a 3rd party vendor, in the way that adding a Trigger would be.

Dunno if there is anything else non-invasive that could track-changes. TLog Reader maybe?

I just need the stinking PKey (well ... any unique key actually) and the ChangeDateTime.

yosiasz · February 7, 2017, 7:23pm

Yes that is exactly what I mean!!! SyncTracker table is just that. Some chron job in the background that runs every x minutes does the magic. I don't think you actually need a trigger. we avoided that in our scenario. Just a polling job.

insert into dbo.TableYourSynchProcessCodeGrabsDataFrom
select PK_ProductID
from dbo.Products p
join dbo.SyncTracker st on p.ChangeDateTime >= st.LastSyncDateTime
where st.SyncSource = 'Products'

Your code that pushes changed data to your ES,couchbase uses this table to push deltas (insert, delete and updates)
Because ChangeDateTime is now greater than LastSyncDateTime?

JeffModen · February 12, 2017, 8:54pm

You absolutely can. Just don't use it as a guarantee that two things are identical. Use it to quickly identify things that aren't and then use a more stringent test to ensure that the items that any form of CHECKSUM or HASHBYTES identified as duplicates actually are duplicates.