SQL query to find duplicates that are alike?

Hello I'm trying to find driver licence numbers that are the same, this is my query:

SELECT DriversLicenseNumber
FROM customer
WHERE (DriversLicenseNumber LIKE '%[0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9]%')
GROUP BY DriversLicenseNumber
HAVING (COUNT(*) > 1)
ORDER BY DriversLicenseNumber

The issue is if a clerk enters them in a little different it doesn't give back those results. For instance if there were 2 records like so 12-345-6789 and 12.345.6789 it won't show those as matching sense they're not identical. What can I add to that query to make it display close matches like the example given? Any help would be great.

Thanks!

DECLARE @t table(LicNo varchar(20));

INSERT @t 
    ( LicNo )
VALUES
    ( '12-345-6789' )
  , ( '12.345.6789' );

SELECT t.LicNo
FROM  @t t
WHERE Replace(Replace(t.LicNo, '-', ''), '.', '') = '123456789';

image

as @jotorre_riversidedpss.org stated you can use his version (method 1) or mine (method 2)

DECLARE @t table(LicNo varchar(20));

INSERT @t
( LicNo )
VALUES
( '12-345-6789' )
, ( '12.345.6789' );

--method 1

SELECT
ROW_NUMBER() over (partition by Replace(Replace(t.LicNo, '-', ''), '.', '') order by (select 0)) rn
,t.LicNo
FROM @t t

--method 2

SELECT
stuff(stuff(LicNo,3,1,''),6,1,'') as LicNo_amended
,t.LicNo
FROM @t t

Only thing is I need it to use wildcards still. The 12-345-6789 was just an example. I would need it to look for other "almost" matching items. Another example I would need it to give a result for 12-345-6789 and 12.345.6789, but also 44-555-666 and 44 55 6666. I thought using [0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9] would give me the results I was looking for. I thinks it's the count reference I was using. Is there a better way to show items that are almost the same minus certain characters like (space), (-), or (.)? Those would be the main charaters I see in the drivers license number field.

DECLARE @t table(LicNo varchar(20));
INSERT @t ( LicNo )
VALUES
    ( '12-345-6789' )
  , ( '12.345.6789' )
  , ( '12*345*6789' )
  , ( '12 345 6789' );
SELECT 
   t.LicNo
FROM
   @t t
WHERE
   PatIndex('%[0-9][0-9]_[0-9][0-9][0-9]_[0-9][0-9][0-9][0-9]%',t.LicNo)=1;

image

1 Like

Or to enumerate duplicate ignoring the seperators

CREATE TABLE #t (LicNo varchar(20));
INSERT #t ( LicNo )
VALUES
    ( '12-345-6789' )
  , ( '12.345.6789' )
  , ( '12*345*6789' )
  , ( '12 345 6789' )
  , ( '13.819.5473' )
  , ( '13*819*5473' )
  , ( '13 819 5473' )
  ;
WITH dup(val, instance)
AS
(SELECT 
   LicNo
 , Row_Number() OVER (PARTITION BY Substring(LicNo, 1, 2), Substring(LicNo, 4, 3), Substring(LicNo, 8, 4) ORDER BY LicNo)
FROM
   #t
)
SELECT dup.val
     , dup.instance
FROM dup;
DROP TABLE #t;

image

1 Like

Here's an alternative (a bit more complicated, but it "focuses" on numeric values)

with cte1(n)
  as (select row_number() over(order by (select null))
        from (values(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)
                   ,(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)
             ) as cte(n)
     )
    ,cte2(driverslicensenumber)
  as (select distinct
             driverslicensenumber
        from customer
       where driverslicensenumber like '%[0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9]%'
     )
    ,cte3(driverslicensenumber,just_num)
  as (select driverslicensenumber
            ,sum(just_num) as just_num
        from (select a.driverslicensenumber
                    ,cast(substring(a.driverslicensenumber,b.n,1) as bigint)
                    *power(10,row_number() over(partition by a.driverslicensenumber
                                                    order by b.n desc
                                               )
                             -1
                          )
                     as just_num
                from cte2 as a
                     cross apply cte1 as b
               where len(a.driverslicensenumber)>=b.n
                 and substring(a.driverslicensenumber,b.n,1) like '[0-9]'
             ) as a
       group by driverslicensenumber
     )
select top(1) with ties
       b.driverslicensenumber
      ,a.just_num
  from cte3 as a
       inner join customer as b
	           on b.driverslicensenumber=a.driverslicensenumber
 order by sign(1-count(*) over(partition by a.just_num))
;

Ideally, you'd change the code that captures the numbers so only valid ones are written to the db. Otherwise, this is a hard problem. What about embedded spaces or tabs? What about underscores or other special characters? Is 12..345.6789 also a duplicate? What about letters mixed in? is l23456789 a match? (begins with the letter "l" not a one.) There's no one-size fits all. This is really about fuzzy matching. There are some good tools in SSIS to do that.

If it were me I wouldn't store the separator characters, just the numbers and use a computed column to show the formatted string. Then you could easily create a check constraint on the DriversLicenseNumber column using IsNumeric(DriversLicenseNumber)=1

OK got it to work using some of the above. I tried this and it returns the results I would need to get:

SELECT REPLACE(REPLACE(REPLACE(DriversLicenseNumber, '-', ''), ' ', ''), '.', '') AS Expr1
FROM customer
WHERE (NOT (DriversLicenseNumber IS NULL)) AND (NOT (DriversLicenseNumber = ''))
GROUP BY REPLACE(REPLACE(REPLACE(DriversLicenseNumber, '-', ''), ' ', ''), '.', '')
HAVING (COUNT(*) > 1)
ORDER BY Expr1

Thanks for your help.

hi

this topic was a very long time ago

i have tried a different approach to this
i have used recursive CTE with PATINDEX
and row number

Seniors "bitsmed and Scott Pletcher"
have mentioned performance issues
with recursive CTE'S

I found this code google searching on the internet

the benefits of this approach
i think is
it looks only for numbers
... god knows how many special cases and characters
.. are there if you are not looking for numbers
:smile:
:wink::wink:

code i found on internet
DECLARE @var VARCHAR(100) = '386(+224)74769111908' 

    WHILE Patindex('%[^0-9]%', @var) <> 0 
      SET @var = Stuff(@var, Patindex('%[^0-9]%', @var), 1, '') 

SELECT @var

this works for a single variable

i was trying to see if i could use the same logic for
a table column

i thought of cross apply
but was not able to do it

i then tried recursive CTE
was able to get it to work

In real time scenarios
will huge millions amount of data
this approach is not tested
might not work

Having said that

:slight_smile:
:slight_smile:

drop create sample data with 4 rows
use  tempdb 
go 

drop TABLE #T

create table #T
(
LicenseNo varchar(100)
)
go 

INSERT INTO #T SELECT '12-345-678-9' 
INSERT INTO #T SELECT '12.345.678-9' 
INSERT INTO #T SELECT '12*345*678-9' 
INSERT INTO #T SELECT '12 345 678-9' 
GO
SQL with rec CTE .. PATINDEX and row number
;WITH cte 
     AS (SELECT Row_number() 
                  OVER( 
                    ORDER BY licenseno) AS rn, 
                licenseno 
         FROM   #t), 
     rec_cte 
     AS (SELECT rn, 
                Stuff(cte.licenseno, Patindex('%[^0-9]%', cte.licenseno), 1, '') 
                AS 
                abc 
         FROM   cte 
         UNION ALL 
         SELECT rn, 
                Stuff(rec_cte.abc, Patindex('%[^0-9]%', rec_cte.abc), 1, '') 
         FROM   rec_cte 
         WHERE  Patindex('%[^0-9]%', rec_cte.abc) <> 0), 
     rn_cte 
     AS (SELECT Row_number() 
                  OVER ( 
                    ORDER BY Len(abc)) AS rn, 
                abc 
         FROM   rec_cte) 
SELECT * 
FROM   rn_cte 
WHERE  rn <= (SELECT Count(*) 
              FROM   #t) 

go
Result

image