INSERT INTO without duplicates problem

dbay · September 23, 2016, 12:34pm

I am having problems updating a table using INSERT INTO. I have a Temp table (TableA) that I do a bulk upload into. I have a Main table (TableB) that needs to be updated from the Temp table (TableA).

TableA can contain duplicates found in TableB. TableA can also contain duplicates found in itself and I think this is what is causing the issue. The Primary Key for both tables is column1.

I use this method for updating to avoid uploading duplicates into TableB from TableA.

    INSERT INTO [dbo].[TableB] (
        [column1],
        [column2],
        [column3],
        [column4],
        [column5] )
    SELECT
        [column1],
        [column2],
        [column3],
        [column4] + '  ' + [column2] + '  ' + [column3],
        [column5]
    FROM 
        [dbo].[TableA] LEFT JOIN [dbo].[TableB] ON
        [dbo].[TableA].[column1] = [dbo].[TableB].[column1]
    WHERE
        [dbo].[TableB].[column1] Is Null;

I am still getting an error saying the there is a Key Violation for trying to upload duplicates. I assume that this is happening because TableA has multiple duplicates in itself. How would I go about handling this?

JamesK · September 23, 2016, 1:00pm

You could try the following:

INSERT INTO [dbo].[TableB] (
        [column1],
        [column2],
        [column3],
        [column4],
        [column5] )
SELECT DISTINCT
    [column1],
    [column2],
    [column3],
    [column4] + '  ' + [column2] + '  ' + [column3],
    [column5]
FROM 
    [dbo].[TableA] AS a
WHERE NOT EXISTS
(
	SELECT * FROM [dbo].[TableB] AS b
	WHERE b.Column1 = a.Column1
)
AND a.Column1 IS NOT NULL;

dbay · September 23, 2016, 1:09pm

Thank you for your reply JamesK. I have already tried this method which produced the same error result. It performs the exact same function as the example I posted but in a tad faster time frame.

dbay · September 23, 2016, 1:50pm

After further testing, I have figured the problem only occurs when there are duplicate indexes in TableA that are not already in TableB. If TableB contains the index already, you can have multiple of the same index in TableA without any problems.

Here is an example of when it will work without an Index Error. TableB already contains all the indexes that TableA is going to compair to:
TableA
column1
1
2
3
4
4
5
TableB
column1
1
2
3
4
5

Here is an example of when it will not work and throw an Index error.
TableA
column1
1
2
3
4
4
5
TableB
column1
1
2
3
5

JamesK · September 23, 2016, 5:14pm

In your example that fails, if the two rows for which Column1 is 4 do not have the same values in the remaining columns (Column2 through Column5), the DISTINCT key word would still keep the duplicates. Hence the error.

If the values in Column2 through 5 are not the same, then the question you have to answer is, which of those values do you want to pick. One way (and I am not recommending this because that may not be what your business requirements call for) is as follows:

;WITH cte AS
(
	SELECT
		[column1],
		[column2],
		[column3],
		[column4] + '  ' + [column2] + '  ' + [column3]
			AS NewColumn4,
		[column5],
		ROW_NUMBER() OVER(PARTITION BY [column1]
			ORDER BY [column2], [column3] ) AS RN

	FROM 
		[dbo].[TableA] AS a
	WHERE NOT EXISTS
		(
			SELECT * FROM [dbo].[TableB] AS b
			WHERE b.Column1 = a.Column1
		)
	AND a.Column1 IS NOT NULL
)
		
INSERT INTO [dbo].[TableB] (
        [column1],
        [column2],
        [column3],
        [column4],
        [column5] )
SELECT
		[column1],
		[column2],
		[column3],
		NewColumn4,
		[column5]
FROM
	cte
WHERE RN=1;

You can try changing the ORDER BY clause in the ROW_NUMBER function to change which row is picked when there are duplicates in Column1

dbay · September 23, 2016, 5:52pm

Hmmm... I am new to TSQL and will have to study up on what you are actually doing in your example JamesK. Especially the "ROW_NUMBER() OVER(PARTITION BY " part. I understand that every column in a record must match another record in order for SELECT DISTINCT to pull out the duplicates and this is part of the problem. They do not match, that is why I am having problems.

Thinking about what I just wrote and looking at your example, there is obviously something wrong with what I am inserting and using as the KEY. Technically, the column that is not matching and not allowing the DISTINCT statement to pull it out as a duplicate, should be part of the KEY as well and be inserted as another unique record.... I have some rethinking to do.
Thank you JamesK. I think you have helped me see the light.

JamesK · September 23, 2016, 8:46pm

The ROW_NUMBER() function assigns a serial number starting at 1 for each set of unique values in the partition by clause and orders the serial number based on the ORDER BY clause. The MSDN page has a few good examples.

One thing I forgot in my previous post was that you need to add a final WHERE clause in the query. Fixed it now.