Prevent Duplicate Insert Into Table with 30 Columns

bbaxter · July 26, 2017, 11:44pm

SQL Server 2012
I have a single table that contains 30 columns of statistic data.
The table schema is fixed and cannot be changed.
The data types for the columns are varchar(5) thru varchar(50)

Q1.
How to create an index for a table with more than 16 columns

Q2.
I need to add functionality that will prevent duplicate records from being entered.
A duplicate is defined as a match to an existing record on all 30 columns.
If all 30 values of the new data are not identical when compared to all columns of each existing table record,
it is considered unique and should be inserted into the table.
Since a unique index only allows 16 columns to be included, how can I process or design the database
to check all 30 columns to prevent duplicate records?

jotorre_riversidedpss.org · July 26, 2017, 11:56pm

What if you created an unique indexed view on the table concatenating all the columns as quoted CSV or TSV. The caveate is they can't be over 900 bytes.

Kristen · July 27, 2017, 6:11am

Trigger?

bitsmed · July 27, 2017, 7:15am

Afaik the limit in index are not number of columns - it is size in bytes.
For clustered index the maximum is 900 bytes, for non-clustered it is 1700 bytes.

Would a non-clustered index suffice?

Kristen · July 27, 2017, 8:07am

I've had to work around merging data where Case (in an otherwise case-insensitive database) and Trailing Spaces were considered as "differences" - i.e. important to merge those rows even though SQL thinks they match

Unique Index has some of those issues (which a Trigger, for example, could work around)

CREATE TABLE #TEMP
(
	Col1 varchar(5) NULL
	, Col2 varchar(5) COLLATE Latin1_General_BIN2 NULL
)

CREATE UNIQUE INDEX TEMP_UNIQUE
on #TEMP
(
	Col1
	, Col2
)
GO

PRINT 'Test 1 OK'
GO
INSERT INTO #TEMP VALUES('ABC', 'DEF')
GO
PRINT 'Test 2 - DUP'
GO
INSERT INTO #TEMP VALUES('ABC', 'DEF')
GO
PRINT 'Test 3 - DUP on COL2 Case Difference - OK'
GO
INSERT INTO #TEMP VALUES('ABC', 'def')
GO
PRINT 'Test 4 - DUP on COL1 Case Difference'
GO
INSERT INTO #TEMP VALUES('abc', 'DEF')
GO
PRINT 'Test 5 - DUP on COL1 trailing spaces only'
GO
INSERT INTO #TEMP VALUES('ABC ', 'DEF')
GO
PRINT 'Test 6 - DUP on COL2 trailing spaces only'
GO
INSERT INTO #TEMP VALUES('ABC', 'DEF ')
GO

DROP TABLE #TEMP
GO

ScottPletcher · July 27, 2017, 1:58pm

Q1. You can't have more than 16 key columns in an index in SQL 2012. (SQL 2016 allows up to 32 key cols, I think.)

Q2. You will have to use a trigger. Create a unique index first with the 16 most-unique of the columns first to speed up the match processing, then:

CREATE TRIGGER table1__trg_check_dups
ON dbo.table1
INSTEAD OF INSERT
AS
SET NOCOUNT ON;
IF EXISTS(SELECT 1 FROM dbo.table1 t1 INNER JOIN inserted i ON t1.col1 = i.col1 AND t1.col2 = i.col2 /*...*/ AND t1.col30 = i.col30)
BEGIN
    RAISERROR('Duplicate row(s) would be created on "table1" by this statement.  This is not allowed, so the statement, and the transaction containing it, have been cancelled.', 16, 1)
    ROLLBACK TRANSACTION
    RETURN
END /*IF*/
INSERT INTO dbo.table1
SELECT *
FROM inserted
GO

Kristen · July 27, 2017, 2:12pm

Assuming haven't got to worry about Alpha-Case being different (i.e. Database Collation takes care of that**) and no need to treat trailing spaces as different, then I think for any column which is NULLable it needs

    (t1.[col1] = i.[col1] OR (t1.[col1] IS NULL AND i.[col1] IS NULL))
AND (t2.[col1] ...

** if database collation IS Case Sensitive AND for the purposes of this exercise different case would be considered a duplicate, then that would need taking into consideration (on String columns only)

jotorre_riversidedpss.org · July 27, 2017, 4:28pm

Please post a create table script for this table with insert statement with sample data.

robert_volk · July 28, 2017, 11:45pm

This is not particularly elegant but should do the trick:

ALTER TABLE myTable ADD col_hash AS HASHBYTES('MD5',
ISNULL(CAST(col1 as varbinary(max)),0x00) + ISNULL(CAST(col2 as varbinary(max)),0x00) +
ISNULL(CAST(col3 as varbinary(max)),0x00) + ISNULL(CAST(col4 as varbinary(max)),0x00) +
ISNULL(CAST(col5 as varbinary(max)),0x00) + ISNULL(CAST(col6 as varbinary(max)),0x00) +
ISNULL(CAST(col7 as varbinary(max)),0x00) + ISNULL(CAST(col8 as varbinary(max)),0x00) +
ISNULL(CAST(col9 as varbinary(max)),0x00) + ISNULL(CAST(col10 as varbinary(max)),0x00) +
ISNULL(CAST(col11 as varbinary(max)),0x00) + ISNULL(CAST(col12 as varbinary(max)),0x00) +
ISNULL(CAST(col13 as varbinary(max)),0x00) + ISNULL(CAST(col14 as varbinary(max)),0x00) +
ISNULL(CAST(col15 as varbinary(max)),0x00) + ISNULL(CAST(col16 as varbinary(max)),0x00) +
ISNULL(CAST(col17 as varbinary(max)),0x00) + ISNULL(CAST(col18 as varbinary(max)),0x00) +
ISNULL(CAST(col19 as varbinary(max)),0x00) + ISNULL(CAST(col20 as varbinary(max)),0x00) +
ISNULL(CAST(col21 as varbinary(max)),0x00) + ISNULL(CAST(col22 as varbinary(max)),0x00) +
ISNULL(CAST(col23 as varbinary(max)),0x00) + ISNULL(CAST(col24 as varbinary(max)),0x00) +
ISNULL(CAST(col25 as varbinary(max)),0x00) + ISNULL(CAST(col26 as varbinary(max)),0x00) +
ISNULL(CAST(col27 as varbinary(max)),0x00) + ISNULL(CAST(col28 as varbinary(max)),0x00) +
ISNULL(CAST(col29 as varbinary(max)),0x00) + ISNULL(CAST(col30 as varbinary(max)),0x00));

ALTER TABLE myTable ADD CONSTRAINT UNQ_col_hash UNIQUE(col_hash);

If all of your columns are declared NOT NULL then you can just CAST(colXX as varbinary(max)). The combination of values should hash to a unique MD5 value which is 128 bits. If you are concerned about hash collisions, simply add a 2nd hashed column like this and use SHA or some other supported hash algorithm, and create a unique constraint on both of those columns.

Edit: F------k, I just noticed you said the table schema can't be changed. Let me think a bit.

Edit 2: I set up a test using an indexed view that prevents duplicates:

SET ANSI_NULLS,ANSI_PADDING,ANSI_WARNINGS,ARITHABORT,CONCAT_NULL_YIELDS_NULL,QUOTED_IDENTIFIER ON;
GO
SET NUMERIC_ROUNDABORT OFF;
GO
create table dbo.A(a varchar(5) not null, b varchar(10) not null);
GO
create view dbo.ind(a,b,hash_col) with schemabinding AS
SELECT a, b, HASHBYTES('MD5',cast(a as varbinary(5))+cast(b as varbinary(10)))
from dbo.A;
GO
create unique clustered index ix on dbo.ind(hash_col);

insert dbo.A(a,b) values('a','b');
insert dbo.A(a,b) values('a','b');  -- this insert fails, violates uniqueness in indexed view

I'm not sure if you can cast as varbinary(max) and have it work properly, since you're not using max types it's probably better to cast to fixed binary anyway. Note that my columns are declared NOT NULL, if yours are not you'll need to wrap them in ISNULL() like in my earlier example.

You can read more about indexed views here:

https://docs.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views

Edit 3: One other thing I noticed, since this casts to binary values it is automatically case-sensitive, regardless of which collation you're using. If you want case-insensitive uniqueness you'll have to do CAST(UPPER(col) as varbinary(...)) inside the HASHBYTES() function.

Edit 4: One last thought before I sleep:

create view dbo.ind(a,b,hash_col) with schemabinding AS
SELECT a, b, HASHBYTES('MD5',cast(a as varbinary(5))+0x00+cast(b as varbinary(10))+0x00)
from dbo.A;
GO

This avoids a problem with concatenating 'ab'+'c' vs. 'a'+'bc' being considered dupes. You'd have to concatenate 0x00 between each column in the HASHBYTES() expression.