Too many rows returned in join statement

ERP · April 18, 2016, 5:42pm

I have two tables a customer and an auditlog. I am joining them on the first 7 characters of customer number.

select a.AuditLogToken,a.Recorddate, a.CreatedBy, c.cust_num, cust_seq,a.messagedesc,a.oldvalue, a.NewValue
from AuditLog a
join custaddr c on Left(a.KeyValue,7) = Left(c.cust_num,7)
where a.MessageDesc like 'cust%' and a.RecordDate > '2016-04-11' and a.NewValue is not null

I need just the AuditLogToken to be DISTINCT. As it is now I am getting 5 or six rows for every AuditLogToken

Any advice would bne greatly appreciated.

scarela · April 18, 2016, 6:56pm

Can you provide us with the script to create the tables and some data sample?

ERP · April 18, 2016, 7:07pm

I am just looking for the correct syntax. When I do the join it creates more rows than desired because there is a sequence column in the custaddr table that creates unique results.

I want the DISTINCT on one column not the others and I don't know how to do that.

JamesK · April 18, 2016, 7:28pm

Most of the time when you see more than expected number of rows, that is because your join conditions are not granular enough. It sounds like that is not the problem in your case.

As you may already know, your query would still return multiple rows if you used the DISTINCT keyword ( as in SELECT DISTINCT a.AuditLogToken,....). This is because a row is considered a duplicate of another if and only if ALL the columns have the same values. Had that not been the case, all you would have to do is to add the DISTINCT keyword as I described.

In your case it sounds like for a given AuditLogToken, you might have more than one value of other columns - for example, a given AuditLogToken might have two rows, each with a different RecordDate. If that is the case, you have to decide which of those rows two values of RecordDate you want to see in your output.

One way to get one row per AuditLogToken is to use the row_number function as shown below:

;WITH cte AS
(
	SELECT  a.AuditLogToken ,
			a.Recorddate ,
			a.CreatedBy ,
			c.cust_num ,
			cust_seq ,
			a.messagedesc ,
			a.oldvalue ,
			a.NewValue,
			ROW_NUMBER() OVER 
			(
				PARTITION BY a.AuditLogToken
				ORDER BY (SELECT NULL)
			) AS RN
	FROM    AuditLog a
			JOIN custaddr c ON LEFT(a.KeyValue, 7) = LEFT(c.cust_num, 7)
	WHERE   a.MessageDesc LIKE 'cust%'
			AND a.RecordDate > '2016-04-11'
			AND a.NewValue IS NOT NULL
)
SELECT * FROM cte WHERE RN = 1;

That just picks one random row from among the duplicate AuditLogTokens. You have some control over which row to pick if you change the "(SELECT NULL)" to list columns you want to order by such that the row you want comes on top.

ERP · April 18, 2016, 7:40pm

I think that gives me exactly what I wanted. The AuditLogToken column is unique within its own table but I would get multiple rows based on the join when all I wanted was the cust_num and cust_sequence from the custaddr table.

I checked this against my query without the join and custaddr columns and they matched up. Thank you for your help!

Kristen · April 19, 2016, 4:00pm

Please pay careful attention to JamesK's statement "That just picks one random row from among the duplicate AuditLogTokens". My recommendation would be that you force a Sort so that, at the least, the order is repeatable - otherwise two separate runs of the query may give you different results. Unfortunately 99% of the time you'll probably get the same result, because the data will usually be cached in the same way, so its a hard bug to resolve [if you don't have a sort to force the query to be repeatable]