T-sql 2012 insert into table without having duplicates

In t-sql 2012, I want unique records (no duplicates) inserted into an empty table called DCert. The column that contains the duplicate values is [State ID].
Here is the sql that I have so far and the duplicate records are appearing on the insert statement.

Note: I do need the following where clause '([State ID] <> (' ')) AND ([State ID] is not null) AND len([State ID]) > 8
due to the rows that contain 'bad data'.

INSERT INTO DCert
(stateID, custnumber, lastname,firstname,mailAddress)
SELECT DISTINCT
DC.[State Id] as stateID
,DC.[ID] as studentnumber
,DC.[Last Name] as lastname
,DC.[First Name] as firstname
,Mailingaddress.mailAddress
FROM test1.dbo.CerImport DC
LEFT JOIN Mailingaddress Mailingaddress ON Mailingaddress.stateID=DC.[State Id]
where

DC.[State ID] IN (
SELECT distinct [State ID]
FROM test1.dbo.CerImport
where ([State ID] <> (' ')) AND ([State ID] is not null)
AND len([State ID]) > 8
)

Thus can you show me the t-sql 2012 so that no duplicate records by [State ID], are not inserted into the DCert table?

For a given [State Id] in test1.dbo.CerImport, WHICH HAS DUPLICATE ROWS on [State Id], are the other columns studentnumber, lastname, firstname, mailAddress ALWAYS the same?

If not using SELECT DISTINCT you will wind up with multiple rows for a given value of [State Id] where the, e.g. lastname is different.

To avoid that instead of using SELECT DISTINCT you could select the "first" value for each [State Id] (ordered by whatever makes the most sense, but if there is no "obvious" preferred value you could just order by studentnumber, lastname, firstname, mailAddress so (if you repeat the import) you get a repeatable ordering to determine the "first row" for each [State Id]

Would that solve your problem?

I would not use the IN clause as you have it, I would just make that into a WHERE on the main SQL statement.

Do you also need to check that the [State Id] being imported from test1.dbo.CerImport does not already exist in the local database DCert.stateID ? Or are the rows in the import guaranteed to not already exist?

That question also applies to "Will you want to "top up" this import with rows which have new [State Id] values in future? If so that could be added to the WHERE clause so that no pre-existing [State Id] that match the local DCert.stateID will be imported.

YES the other values are always the same! This is an empty table right now so there are no duplicates that exist.
Based upon your question, I will also add the columns called importDate so I will only pick records from the most recent load of the table. How would you setup this sql?

This ought to be sufficient (**):

INSERT INTO DCert
(
	stateID, custnumber, lastname, firstname, mailAddress
)
SELECT DISTINCT
	DC.[State Id] as stateID
	, DC.[ID] as studentnumber
	, DC.[Last Name] as lastname
	, DC.[First Name] as firstname
	, Mailingaddress.mailAddress
FROM	test1.dbo.CerImport AS DC
	LEFT JOIN Mailingaddress AS Mailingaddress
		ON Mailingaddress.stateID = DC.[State Id]
WHERE	LEN([State ID]) > 8 
--	AND [State ID] <> ' '		-- Redundant
--	AND [State ID] IS NOT NULL	-- Redundant

but I would check that you have no obvious duplicates first:

SELECT	TOP 100
	DC.[State ID]
	, DC.[ID]
	, COUNT(*) AS [Count]
FROM
(
	SELECT DISTINCT
		DC.[State Id] as stateID
		, DC.[ID] as studentnumber
		, DC.[Last Name] as lastname
		, DC.[First Name] as firstname
		, Mailingaddress.mailAddress
	FROM	test1.dbo.CerImport AS DC
		LEFT JOIN Mailingaddress AS Mailingaddress
			ON Mailingaddress.stateID = DC.[State Id]
	WHERE	LEN([State ID]) > 8 
--		AND [State ID] <> ' '		-- Redundant
--		AND [State ID] IS NOT NULL	-- Redundant
) AS X
GROUP BY DC.[State ID]
	, DC.[ID]
HAVING COUNT(*) > 1

if that returns no rows then you have NO duplicates on DC.[State Id], DC.[ID] but with different names

(**) Just checking: Mailingaddress is on the LOCAL server/database, and NOT in the [test1] database?

1 Like

Is it possible that a student has more than one mailing address? If so - then you need to identify which address you want to keep and how to identify that address in the join.

1 Like