SQL query to find duplicates that are alike?

taunt · April 24, 2018, 10:50pm

Hello I'm trying to find driver licence numbers that are the same, this is my query:

SELECT DriversLicenseNumber
FROM customer
WHERE (DriversLicenseNumber LIKE '%[0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9]%')
GROUP BY DriversLicenseNumber
HAVING (COUNT(*) > 1)
ORDER BY DriversLicenseNumber

The issue is if a clerk enters them in a little different it doesn't give back those results. For instance if there were 2 records like so 12-345-6789 and 12.345.6789 it won't show those as matching sense they're not identical. What can I add to that query to make it display close matches like the example given? Any help would be great.

Thanks!

jotorre_riversidedpss.org · April 24, 2018, 11:58pm

DECLARE @t table(LicNo varchar(20));

INSERT @t 
    ( LicNo )
VALUES
    ( '12-345-6789' )
  , ( '12.345.6789' );

SELECT t.LicNo
FROM  @t t
WHERE Replace(Replace(t.LicNo, '-', ''), '.', '') = '123456789';

Muj9 · April 25, 2018, 11:28am

as @jotorre_riversidedpss.org stated you can use his version (method 1) or mine (method 2)

DECLARE @t table(LicNo varchar(20));

INSERT @t
( LicNo )
VALUES
( '12-345-6789' )
, ( '12.345.6789' );

--method 1

SELECT
ROW_NUMBER() over (partition by Replace(Replace(t.LicNo, '-', ''), '.', '') order by (select 0)) rn
,t.LicNo
FROM @t t

--method 2

SELECT
stuff(stuff(LicNo,3,1,''),6,1,'') as LicNo_amended
,t.LicNo
FROM @t t

taunt · April 25, 2018, 11:59am

Only thing is I need it to use wildcards still. The 12-345-6789 was just an example. I would need it to look for other "almost" matching items. Another example I would need it to give a result for 12-345-6789 and 12.345.6789, but also 44-555-666 and 44 55 6666. I thought using [0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9] would give me the results I was looking for. I thinks it's the count reference I was using. Is there a better way to show items that are almost the same minus certain characters like (space), (-), or (.)? Those would be the main charaters I see in the drivers license number field.

jotorre_riversidedpss.org · April 25, 2018, 3:28pm

DECLARE @t table(LicNo varchar(20));
INSERT @t ( LicNo )
VALUES
    ( '12-345-6789' )
  , ( '12.345.6789' )
  , ( '12*345*6789' )
  , ( '12 345 6789' );
SELECT 
   t.LicNo
FROM
   @t t
WHERE
   PatIndex('%[0-9][0-9]_[0-9][0-9][0-9]_[0-9][0-9][0-9][0-9]%',t.LicNo)=1;

jotorre_riversidedpss.org · April 25, 2018, 3:39pm

Or to enumerate duplicate ignoring the seperators

CREATE TABLE #t (LicNo varchar(20));
INSERT #t ( LicNo )
VALUES
    ( '12-345-6789' )
  , ( '12.345.6789' )
  , ( '12*345*6789' )
  , ( '12 345 6789' )
  , ( '13.819.5473' )
  , ( '13*819*5473' )
  , ( '13 819 5473' )
  ;
WITH dup(val, instance)
AS
(SELECT 
   LicNo
 , Row_Number() OVER (PARTITION BY Substring(LicNo, 1, 2), Substring(LicNo, 4, 3), Substring(LicNo, 8, 4) ORDER BY LicNo)
FROM
   #t
)
SELECT dup.val
     , dup.instance
FROM dup;
DROP TABLE #t;

bitsmed · April 25, 2018, 7:06pm

Here's an alternative (a bit more complicated, but it "focuses" on numeric values)

with cte1(n)
  as (select row_number() over(order by (select null))
        from (values(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)
                   ,(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)
             ) as cte(n)
     )
    ,cte2(driverslicensenumber)
  as (select distinct
             driverslicensenumber
        from customer
       where driverslicensenumber like '%[0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9]%'
     )
    ,cte3(driverslicensenumber,just_num)
  as (select driverslicensenumber
            ,sum(just_num) as just_num
        from (select a.driverslicensenumber
                    ,cast(substring(a.driverslicensenumber,b.n,1) as bigint)
                    *power(10,row_number() over(partition by a.driverslicensenumber
                                                    order by b.n desc
                                               )
                             -1
                          )
                     as just_num
                from cte2 as a
                     cross apply cte1 as b
               where len(a.driverslicensenumber)>=b.n
                 and substring(a.driverslicensenumber,b.n,1) like '[0-9]'
             ) as a
       group by driverslicensenumber
     )
select top(1) with ties
       b.driverslicensenumber
      ,a.just_num
  from cte3 as a
       inner join customer as b
	           on b.driverslicensenumber=a.driverslicensenumber
 order by sign(1-count(*) over(partition by a.just_num))
;

gbritton · April 25, 2018, 8:19pm

Ideally, you'd change the code that captures the numbers so only valid ones are written to the db. Otherwise, this is a hard problem. What about embedded spaces or tabs? What about underscores or other special characters? Is 12..345.6789 also a duplicate? What about letters mixed in? is l23456789 a match? (begins with the letter "l" not a one.) There's no one-size fits all. This is really about fuzzy matching. There are some good tools in SSIS to do that.

jotorre_riversidedpss.org · April 25, 2018, 9:38pm

If it were me I wouldn't store the separator characters, just the numbers and use a computed column to show the formatted string. Then you could easily create a check constraint on the DriversLicenseNumber column using IsNumeric(DriversLicenseNumber)=1

taunt · April 25, 2018, 9:57pm

OK got it to work using some of the above. I tried this and it returns the results I would need to get:

SELECT REPLACE(REPLACE(REPLACE(DriversLicenseNumber, '-', ''), ' ', ''), '.', '') AS Expr1
FROM customer
WHERE (NOT (DriversLicenseNumber IS NULL)) AND (NOT (DriversLicenseNumber = ''))
GROUP BY REPLACE(REPLACE(REPLACE(DriversLicenseNumber, '-', ''), ' ', ''), '.', '')
HAVING (COUNT(*) > 1)
ORDER BY Expr1

Thanks for your help.

harishgg1 · November 25, 2018, 1:08pm

hi

this topic was a very long time ago

i have tried a different approach to this
i have used recursive CTE with PATINDEX
and row number

Seniors "bitsmed and Scott Pletcher"
have mentioned performance issues
with recursive CTE'S

I found this code google searching on the internet

the benefits of this approach
i think is
it looks only for numbers
... god knows how many special cases and characters
.. are there if you are not looking for numbers

code i found on internet

DECLARE @var VARCHAR(100) = '386(+224)74769111908' 

    WHILE Patindex('%[^0-9]%', @var) <> 0 
      SET @var = Stuff(@var, Patindex('%[^0-9]%', @var), 1, '') 

SELECT @var

this works for a single variable

i was trying to see if i could use the same logic for
a table column

i thought of cross apply
but was not able to do it

i then tried recursive CTE
was able to get it to work

In real time scenarios
will huge millions amount of data
this approach is not tested
might not work

Having said that

drop create sample data with 4 rows

use  tempdb 
go 

drop TABLE #T

create table #T
(
LicenseNo varchar(100)
)
go 

INSERT INTO #T SELECT '12-345-678-9' 
INSERT INTO #T SELECT '12.345.678-9' 
INSERT INTO #T SELECT '12*345*678-9' 
INSERT INTO #T SELECT '12 345 678-9' 
GO

SQL with rec CTE .. PATINDEX and row number

;WITH cte 
     AS (SELECT Row_number() 
                  OVER( 
                    ORDER BY licenseno) AS rn, 
                licenseno 
         FROM   #t), 
     rec_cte 
     AS (SELECT rn, 
                Stuff(cte.licenseno, Patindex('%[^0-9]%', cte.licenseno), 1, '') 
                AS 
                abc 
         FROM   cte 
         UNION ALL 
         SELECT rn, 
                Stuff(rec_cte.abc, Patindex('%[^0-9]%', rec_cte.abc), 1, '') 
         FROM   rec_cte 
         WHERE  Patindex('%[^0-9]%', rec_cte.abc) <> 0), 
     rn_cte 
     AS (SELECT Row_number() 
                  OVER ( 
                    ORDER BY Len(abc)) AS rn, 
                abc 
         FROM   rec_cte) 
SELECT * 
FROM   rn_cte 
WHERE  rn <= (SELECT Count(*) 
              FROM   #t) 

go

Result