Hello I'm trying to find driver licence numbers that are the same, this is my query:
SELECT DriversLicenseNumber
FROM customer
WHERE (DriversLicenseNumber LIKE '%[0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9]%')
GROUP BY DriversLicenseNumber
HAVING (COUNT(*) > 1)
ORDER BY DriversLicenseNumber
The issue is if a clerk enters them in a little different it doesn't give back those results. For instance if there were 2 records like so 12-345-6789 and 12.345.6789 it won't show those as matching sense they're not identical. What can I add to that query to make it display close matches like the example given? Any help would be great.
Only thing is I need it to use wildcards still. The 12-345-6789 was just an example. I would need it to look for other "almost" matching items. Another example I would need it to give a result for 12-345-6789 and 12.345.6789, but also 44-555-666 and 44 55 6666. I thought using [0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9] would give me the results I was looking for. I thinks it's the count reference I was using. Is there a better way to show items that are almost the same minus certain characters like (space), (-), or (.)? Those would be the main charaters I see in the drivers license number field.
Here's an alternative (a bit more complicated, but it "focuses" on numeric values)
with cte1(n)
as (select row_number() over(order by (select null))
from (values(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)
,(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)
) as cte(n)
)
,cte2(driverslicensenumber)
as (select distinct
driverslicensenumber
from customer
where driverslicensenumber like '%[0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9][0-9]%'
)
,cte3(driverslicensenumber,just_num)
as (select driverslicensenumber
,sum(just_num) as just_num
from (select a.driverslicensenumber
,cast(substring(a.driverslicensenumber,b.n,1) as bigint)
*power(10,row_number() over(partition by a.driverslicensenumber
order by b.n desc
)
-1
)
as just_num
from cte2 as a
cross apply cte1 as b
where len(a.driverslicensenumber)>=b.n
and substring(a.driverslicensenumber,b.n,1) like '[0-9]'
) as a
group by driverslicensenumber
)
select top(1) with ties
b.driverslicensenumber
,a.just_num
from cte3 as a
inner join customer as b
on b.driverslicensenumber=a.driverslicensenumber
order by sign(1-count(*) over(partition by a.just_num))
;
Ideally, you'd change the code that captures the numbers so only valid ones are written to the db. Otherwise, this is a hard problem. What about embedded spaces or tabs? What about underscores or other special characters? Is 12..345.6789 also a duplicate? What about letters mixed in? is l23456789 a match? (begins with the letter "l" not a one.) There's no one-size fits all. This is really about fuzzy matching. There are some good tools in SSIS to do that.
If it were me I wouldn't store the separator characters, just the numbers and use a computed column to show the formatted string. Then you could easily create a check constraint on the DriversLicenseNumber column using IsNumeric(DriversLicenseNumber)=1
OK got it to work using some of the above. I tried this and it returns the results I would need to get:
SELECT REPLACE(REPLACE(REPLACE(DriversLicenseNumber, '-', ''), ' ', ''), '.', '') AS Expr1
FROM customer
WHERE (NOT (DriversLicenseNumber IS NULL)) AND (NOT (DriversLicenseNumber = ''))
GROUP BY REPLACE(REPLACE(REPLACE(DriversLicenseNumber, '-', ''), ' ', ''), '.', '')
HAVING (COUNT(*) > 1)
ORDER BY Expr1
i have tried a different approach to this
i have used recursive CTE with PATINDEX
and row number
Seniors "bitsmed and Scott Pletcher"
have mentioned performance issues
with recursive CTE'S
I found this code google searching on the internet
the benefits of this approach
i think is
it looks only for numbers
... god knows how many special cases and characters
.. are there if you are not looking for numbers
code i found on internet
DECLARE @var VARCHAR(100) = '386(+224)74769111908'
WHILE Patindex('%[^0-9]%', @var) <> 0
SET @var = Stuff(@var, Patindex('%[^0-9]%', @var), 1, '')
SELECT @var
this works for a single variable
i was trying to see if i could use the same logic for
a table column
i thought of cross apply
but was not able to do it
i then tried recursive CTE
was able to get it to work
In real time scenarios
will huge millions amount of data
this approach is not tested
might not work
Having said that
drop create sample data with 4 rows
use tempdb
go
drop TABLE #T
create table #T
(
LicenseNo varchar(100)
)
go
INSERT INTO #T SELECT '12-345-678-9'
INSERT INTO #T SELECT '12.345.678-9'
INSERT INTO #T SELECT '12*345*678-9'
INSERT INTO #T SELECT '12 345 678-9'
GO
SQL with rec CTE .. PATINDEX and row number
;WITH cte
AS (SELECT Row_number()
OVER(
ORDER BY licenseno) AS rn,
licenseno
FROM #t),
rec_cte
AS (SELECT rn,
Stuff(cte.licenseno, Patindex('%[^0-9]%', cte.licenseno), 1, '')
AS
abc
FROM cte
UNION ALL
SELECT rn,
Stuff(rec_cte.abc, Patindex('%[^0-9]%', rec_cte.abc), 1, '')
FROM rec_cte
WHERE Patindex('%[^0-9]%', rec_cte.abc) <> 0),
rn_cte
AS (SELECT Row_number()
OVER (
ORDER BY Len(abc)) AS rn,
abc
FROM rec_cte)
SELECT *
FROM rn_cte
WHERE rn <= (SELECT Count(*)
FROM #t)
go