Replication out of sync, but status is OK/Excellent

We have two databases on different servers (2008 and 2005). Each has a publication and subscription. One publishes 3 tables to the other, and the other publishes 6 tables to the other. The replication monitor says both publications are running. The performance is excellent on both and the latency is 00:00:02 on one and 00:00:04 on the other.

All tables have the same number of rows across databases, but some of the values in the columns are different. I ran a query comparing TableA.col1 to TableB.col3 on one server and it returns zero rows. On the other server, the same query against the same tables returns 15 rows. TableA is published from one server and TableB is published from the other server.

What would cause this and how should I correct it? Is there something I can do that can alert me to this?

Thank you,

Greg

Regarding the alert/monitor for this, Kendra Little discusses a Canary table solution that is interesting and can be used for other technologies too.

http://www.brentozar.com/archive/2014/07/monitoring-sql-server-transactional-replication/

Tara,

Thanks. Interesting article. I think I will set up some canary tables. Do you think if I reinitialized the subscription it would correct the problem I have now?

Greg

Yes it would, however you need to figure out why it happened. I bet replication is throwing an error or is latent and that the monitor is just not showing it. Check the log reader agent's status, verify the distributor, etc.

Yesterday I reinitialized the publications on both servers with both 'Use current snapshot' and then 'use new snapshot', when the first did not clear up the problems. This morning I checked and only 4 of the 13 out of sync rows were cleared up. All publications, subscriptions, and agents are "Excellent", "Running", "OK". All latency readings are between 00:00:02 and 00:00:04.

What log do I looking in to check for errors. I found the page below, but it is for 2016 and the commands don't run on 2008.

https://msdn.microsoft.com/en-us/library/ms146878.aspx

Greg

Insert a tracer token into each of the publications. Does it come back within a few seconds?

P to D 3 seconds; D to S 5 seconds. It was 2 seconds better on the other server.

Are you sure the tables that have this out of sync issue are in the publications. If the tracer tokens came back that fast, then those tables can't be out of sync by more than 5 seconds.

Yes. The publication on server A is just one table. The Publication on server B is 5 tables. The query I run compares the Server A table to a view created from the tables on server B. Server B distributes to server A and server A distributes to server B.

If I run the same query on both servers that compared the table to the view I should get the same results, but I don't.

The table from Server A is out of sync with server B. The query below returns different results. I ran it just now after I reinitialized with a new snap shot yesterday. It is very odd.

SELECT MPI FROM AllAccounts WHERE ID = '06324RR'

I think the issue is with the query you are running or with the view. I don't think you have a replication problem.

I don't how that is possible. I took the view out of the equation. I copy and paste the same query referencing the same table from two different servers and get two different results. One result from the original table and one result from replicated table. The query could not be more simple: SELECT columnA from TableA WHERE IndexID = 244939 (IndexID is an identity field).

All other columns are identical, which is not surprising in this case. The row has both create date and edit date columns and edit date is null. Our normal process is that the out of sync column is routinely updated on the row after it is first created, but that is not considered an edit, so edit date does not get added or updated for changes on this one column. That column can have multiple updates, but they are not that frequent. Edits are on other columns is more rare, but does happen.

I'll admit that it does not seem like a replication problem because the replication seems to be working properly, but what else would explain this. Nine rows - the value in one column in nine rows does not match out of 287,887 rows.

Greg

I'll also add that new rows are added by the minute. In the past few minutes the row count went from 287,887 to 287,891, and both servers show an identical increase in rows. Still, those nine rows don't match.

Greg