Duplicates two columns and two rows

Barkas · January 22, 2021, 5:04pm

Hi - do you help me, pls.
How select or delete duplicates row?

From this example:

CREATE TABLE #TEST (ID INT NOT NULL, VAL_A VARCHAR(10) NOT NULL, VAL_B VARCHAR(10) NOT NULL)
-- DROP TABLE #TEST
INSERT INTO #TEST (ID, VAL_A, VAL_B)
VALUES(1, 1,2),(2, 2,1), (3,6,5),(4,5,6)

SELECT * FROM #TEST

ScottPletcher · January 22, 2021, 5:39pm

The best way is to prevent dups from being INSERTed, like so:

CREATE TABLE #TEST ( ID INT NOT NULL, VAL_A VARCHAR(10) NOT NULL,

   VAL_B VARCHAR(10) NOT NULL, CHECK(VAL_A < VAL_B) )

INSERT INTO #TEST (ID, VAL_A, VAL_B)
VALUES(1, 1,2),(4,5,6)
INSERT INTO #TEST (ID, VAL_A, VAL_B)
VALUES(2, 2,1)
GO
INSERT INTO #TEST (ID, VAL_A, VAL_B)
VALUES(3,6,5)

JeffModen · January 23, 2021, 2:01am

@Barkas ,

Heh... too funny how life sometimes works. 24 years ago, the very first question I had on my very first project was nearly identical. I couldn't get an answer from the instructor that I was taking my one and only SQL training course from and the forum I found way back then (Belution.com... long dead now) also had a lot of similar questions. I even gave MS a call... they all gave me the same technically correct answer of "If you had done things the right way, you wouldn't have any dupes".

The problem was that I didn't create the data... it came in a file and I was the one that had to figure out what the dupes were and delete all but one row of duplicated rows and so their well intended technically correct answer was totally useless to me.

Your problem specifically says "Duplicates two columns and two rows" implying that the data will only ever have two rows for dupes and that there will ALWAYS be two dupes. You shouldn't actually count on that. There could be more than two dupes and you could end up with rows with no dupes.

And, you don't want to have a whole batch of rows get rejected just because 1 row fails some check constraint.

With that in mind, you need to make your code a whole lot more bullet proof so if/when that changes someday, your code will auto-magically handle it all without missing a beat.

With such anomalies in mind, let's add some rows to your original test data (and thank you for posting that!)...

   DROP TABLE IF EXISTS #TEST;
 CREATE TABLE #TEST
        (
         ID     INT         NOT NULL
        ,VAL_A  VARCHAR(10) NOT NULL
        ,VAL_B  VARCHAR(10) NOT NULL
        )
;
 INSERT INTO #TEST
        (ID, VAL_A, VAL_B)
 VALUES  (1,1,2),(2,2,1),(3,6,5),(4,5,6) --Original Test Data
        ,(5,7,8),(6,8,7),(7,7,8)         --Triplicate
        ,(8,9,1)                         --No Dupe
;
 SELECT * FROM #TEST
;

Very fortunately, the code to do the bullet-proofing is the same code to solve your immediate problem with its current limited requirements. Here's the code to solve your problem and future problems. It DOES assume that you want to keep the one duplicate row that has the lowest ID. You can certainly and very easily flip that around to keep the one duplicate row that has the highest but I'm sure you'll have no problem figuring that out if that's what you need to do.

   WITH cteSortVal AS
(
 SELECT  ID
        ,VAL_A
        ,VAL_B
        ,Dupe# = ROW_NUMBER() OVER (PARTITION BY IIF(VAL_A <= VAL_B, VAL_A, VAL_B)
                                               , IIF(VAL_A <= VAL_B, VAL_B, VAL_A)
                                        ORDER BY ID)
   FROM #Test
)
 DELETE FROM cteSortVal
  WHERE Dupe# > 1
;
 SELECT * 
   FROM #Test
;

Also notice the trick of deleting from the original CTE, which deletes the rows from the underlying table.

ScottPletcher · January 23, 2021, 2:08am

They could count on that if the original table had a UNIQUE constraint on ( VAL_A, VAL_B ). Maybe not likely, but definitely possible.

ScottPletcher · January 23, 2021, 2:12am

I still say it's better to make sure the values are ( low, high ) than to constantly DELETE dups.

Next best would be a trigger that didn't INSERT the dups, i.e. an INSTEAD OF INSERT trigger, but those can sometimes be work to maintain. Still, I'd do that in preference to constantly deleting dups.

JeffModen · January 23, 2021, 2:13am

If a dupe is considered to be a dupe even when the values of the two columns are reversed, then a simple UNIQUE constraint isn't going to do it.

I agree that it would be nice to have the values in order (low, high) but you can't actually count on the provider following that rule. You also don't want to miss a singleton value just because it's in high.low order.

If someone wants to, they could load the data into a staging table and then they could use the same formulas I used in the ROW_NUMBER() function to put that two values in low/high order before they inserted it into the final table to guarantee the order of the data to meet the constraint AND they could also make a rejection index using the IGNORE_DUP_KEY parameter.

ScottPletcher · January 23, 2021, 2:42am

I didn't say it would. I said a constraint of that type would insure no more than two dups, as the OP stated and you commented on.

ScottPletcher · January 23, 2021, 2:44am

You can if you enforce it yourself. You could even do all reads / lookups against that table using a view that flipped the values back to ( high, low ) if you really considered it necessary for some odd reason. That is, you could add a bit flag that "told" you if the values originally came in ( high, low ) before you flipped them on INSERT, then you could flip them back on SELECT. Again, if for some reason you really needed to do that vs. just always using ( low, high ) as the values.

JeffModen · January 23, 2021, 2:56am

Yup. You could do all of that... or just use the code I wrote.

ScottPletcher · January 23, 2021, 7:31am

Why not store them in order instead of forever having to sort them and aggregate them just to see which is lower? What a waste of resources. Much easier to just insure, with a trigger, that the values go in ( smaller, larger ) order.

JeffModen · January 23, 2021, 5:50pm

I agree... and in one of the posts above, I explained how the code I posted could be easily modified to do so. The problem is that of adding new data. You still need to do a possible swap of values and you still need to do a dupe check. Depending on requirements, you also may not want a multi-row insert to cause a rollback because of single dupe or out-of-order pair of values. If without the latter requirement, the trigger would need to be an INSTEAD OF INSERT trigger to guarantee that the incoming rows has left-to-right ascending values and you'd still need to do a dupe check or use a UNIQUE constraint/index that would use the IGNORE_DUP_KEY.

If all of that isn't tolerable because a requirement exists that states a multi-row INSERT must fail entirely until the data is fixed, then you're still going to have to do all that which you currently rail against.

As with all else in SQL Server, "It Depends" and, as in many cases, it depends on what the actual requirements are, which have not been clearly stated except to find dupes. Just adding a constraint on a table isn't going to meet that requirement and a simple check constraint for the sort order of the two columns will do nothing to prevent dupes by itself.

ScottPletcher · January 24, 2021, 5:10am

Again, I never said it would. But a check constraint for order and a unique constraint, as I think I clearly implied if not directly stated, would indeed prevent dupes and avoid having to write any other code.

Finally, if you needed good INSERTs and not bad ones to go thru, then, as I stated earlier, you'd need an INSTEAD OF INSERT trigger. The advantage of that is you could then drop the low/high constraint and instead flip the low and high values in that trigger if needed. Also, you could log the bad / unloaded into a separate table if desired.

JeffModen · January 31, 2021, 6:04pm

What I'm getting at is it won't allow for non-Duplicated entries that don't happen to be in the correct order, which you have to expect from outside sources.

ScottPletcher · February 1, 2021, 12:09am

I don't follow. If the trigger puts the values in order and checks for dups, I don't see what could go wrong.

JeffModen · February 1, 2021, 4:13am

What do you know about the input data once the trigger is done? The answer is... nothing. I agree that it has cleaned up the data and I agree that the constraints make sure that no anomalies appear in the table and I agree that it will run forever without lifting an additional finger. But... there's no measure of the quality of the data. You don't know if it's getting better or worse. You don't know if there's singleton data, dupe data, triplicate data. You don't know if the data source is getting better or worse nor even if you should trust it in the future.

I know that it's incredibly difficult to see the value in knowing such things and it's "far fetched" in the minds of those that haven't experienced future problems with the data that could have been caught if you knew the condition of the original source data. And with that, I'll stop pushing on this problem but I'd strongly recommend that, at the very least, the input trigger at least write the conditions of how it finds the original data to at least a summary table with a tally of what occurred in each batch. After that, it's ok to artificially change the order of the likes of singleton rows so that they fit in a table.

JeffModen · February 1, 2021, 4:21am

Two examples I've run into in the past where the data provider had problems that they never checked on...

All the data (for 10's of thousands of rows) was identical because they made an untested change and never checked the output..
The second problem was similar. Although the data varied and only had the expect duplicate rows, none of the data ever changed in the file after a certain point in time. It turned out that the vendor had setup a test to use the same file over and over again while testing and forgot to remove that nuance from the code.

You can't just warp the data to fit the table... you have to know the condition(s) of the data before you warp it and you have to know it all the time.

There are many more examples but the two I provided should be concern enough for what I'm saying. That's what could, has, and will go wrong again.

And, yeah... I know the OP didn't ask for all that... the bad part is, they never do. And, yes... I didn't provide that info either when I first answered but the more we got into it, the more I thought about what has happened with such things in the past.

ScottPletcher · February 1, 2021, 3:11pm

That's why the table needs the two constraints I mentioned (the UNIQUE clustered index being considered a constraint). I think that prevents all bad data from getting in. And the dup key errors when trying to load new data would quickly point you back to bad source data.

CREATE TABLE dbo.table_name (
    ID int NOT NULL, 
    VAL_A varchar(10) NOT NULL,
    VAL_B varchar(10) NOT NULL,
    CHECK(VAL_A < VAL_B) --<<--
    ) 
CREATE UNIQUE CLUSTERED INDEX table_name__CL ON dbo.table_name ( VAL_A, VAL_B )
    WITH ( FILLFACTOR = 98 /*or whatever*/, .... ) 
    ON [PRIMARY] /*or other filegroup*/;