Identify source of duplication

sqlor · February 8, 2022, 8:49pm

Hello

I have a table with 'duplicates' (i.e. duplicate records) per each Col1 value and I want to check which Cols are causing this 'duplication'.

E.g.
Product1, Price1, Location1,Region1
Product1, Price2, Location1,Region1
Product2, Price1, Location1,Region2
Product2, Price1, Location2,Region2
...

In the above, both the Price and Location are causing 'duplication' but not the Region.

Is there a neat way to identify these Cols?

Thanks!

jeffw8713 · February 8, 2022, 9:55pm

First - these are not duplicates. Duplicates would be the same exact same values in each column. Second - you cannot identify these as duplicate unless you know what constitutes a duplicate row.

In your example, you have determined that price or location being different but everything else being the same would be considered a duplicate. Unless you already know this before you run a query there is no way you can determine that these are actual duplicates.

Expand on your example:

Product1, Price1, Location1,Region1
Product1, Price2, Location2,Region2
Product1, Price3, Location3,Region3
Product2, Price1, Location1,Region1
Product2, Price2, Location2,Region2
Product2, Price3, Location3,Region3

Are these duplicates because they have different values for price, location and region? Or is the combination of all 4 columns unique - defined the price for a specific location and region?

sqlor · February 19, 2022, 11:28pm

Thanks, let me explain in another way:
In my original example:
For Product1, there are more than one Price and all other fields are same
For Product2, there are more than one Locations and all other fields are same

So I am looking to easily identify:
for each Product, which other field has more than one values

Is that easily structured? If it requires a three page code, it will be difficult for me to remember it.

jeffw8713 · February 20, 2022, 5:29pm

This doesn't explain anything - if you want to identify 'duplicates' then you have to have a definition of what constitutes a duplicate. Then - a simple query using row_number, dense_rank or rank can be used.

For example:

With dups
As (
Select *, rn = row_number() over(partition by ProductID, Price Order By Location)
From someTable
)
Select *
From someTable   s
Where Exists (Select * From dups d Where d.ProductID = s.ProductID And d.Price = s.Price And rn > 1)

This will provide a list of all products with the same price that have multiple locations.

Why would you need to 'remember' anything - once you have the code written, it is saved in a file and all you have to do is pull up the file and execute the code. And - you add plenty of documentation to that code so you don't have to remember, just review your documentation.

And - for new requests you can start with this saved code and adapt it to the new request, and save that script for later use as needed.