Find the best match

barnabeck · October 4, 2022, 3:23pm

I have 2 tables (ID, Name / both Key Columns) of 2 independent sistems that carry more or less the same information but using another nomenclature for the naming and different Identificators ID.
I need to get the best guess for what ID corresponds to the ID of the master system.

example:
Table1: ID = 100, Name = 'Red Green Purple'
Table2: ID = 200, Name = 'Green Purple Black'
I want that for ID = 100 to be returned ID, 66% as 2 out of the 3 words match that master ID (in case that there isn't a better match)

My approach is:

Split the Name into its components (cross apply String_Split(name,' ')) for Table1
Link the result to Table2: left outer join Table2 b on b.NAME like '%' + a.value + '%')

This gives me basically all the records I need to consider. The rest should be only countig and comparing in order to get the best guess for the correlation between the 2 ID's. Unfortunately I don't see how to do this step. It's a mapping process with not identical names.

Here is some sample data:

IF OBJECT_ID('tempdb..#tbl1') IS NOT NULL
DROP TABLE #tbl1
CREATE TABLE #tbl1 (id int, name nvarchar(30))

INSERT INTO #tbl1 (id, name) Values (100, 'red green purple')
INSERT INTO #tbl1 (id, name) Values (101, 'blue red')
INSERT INTO #tbl1 (id, name) Values (102, 'brown black')
INSERT INTO #tbl1 (id, name) Values (103, 'orange green')

IF OBJECT_ID('tempdb..#tbl2') IS NOT NULL
DROP TABLE #tbl2
CREATE TABLE #tbl2 (id int, name nvarchar(30))

INSERT INTO #tbl2 (id, name) Values (200, 'black green')
INSERT INTO #tbl2 (id, name) Values (201, 'red blue')
INSERT INTO #tbl2 (id, name) Values (202, 'red black')
INSERT INTO #tbl2 (id, name) Values (203, 'orange brown red green')

;with data as (
select id, name, Value
from #tbl1
cross apply String_Split(name,' ')),

match as (
select a.id, a.name, a.Value, b.name as master_name, b.id as master_ID
from data a
left outer join #tbl2 b on b.NAME like '%' + a.value + '%')

select * from match

The result I want for this sample wuld be this: (in case multiple records have the same match quality, I want all of them to be displayed)

ID    Name                Name_Master             ID_Master   Match
------------------------------------------------------------------
100  'Red Green Purple'  'Orange Brown Red Green'  203         66%
101  'Blue Red'          'Red Blue'                201        100%
102  'Brown Black'       'Orange Brown Red Green'  203         33%
102  'Brown Black'       'Black Green'             200         33%
102  'Brown Black'       'Red Black'               202         33%
103  'Orange Green'      'Orange Brown Red Green'  203        100%

ScottPletcher · October 4, 2022, 6:08pm

First see if this works for you, then we can worry about efficiency if needed:


;with data1 as (
select id, name, Value
from #tbl1
cross apply String_Split(name,' ')
),
data2 as (
select id, name, Value
from #tbl2
cross apply String_Split(name,' ')
)
select d1.*, d2.*
from (
    select d1.id as id1, d2.id as id2, 
        row_number() over(partition by d1.id order by count(*) desc) as row_num
    from data1 d1
    inner join data2 d2 on d2.value = d1.value
    group by d1.id, d2.id
) as matches
inner join #tbl1 d1 on d1.id = matches.id1
inner join #tbl2 d2 on d2.id = matches.id2
where row_num = 1

Ifor · October 4, 2022, 6:20pm

I do not understand where 33% comes from.
The following produces 50% instead of 33%. The efficiency can probably be improved.

WITH T1
AS
(
    SELECT T.id, T.[name], X.[value]
    FROM #tbl1 T
        CROSS APPLY STRING_SPLIT([name], ' ') X
)
,T2
AS
(
    SELECT T.id, T.[name], X.[value]
    FROM #tbl2 T
    CROSS APPLY STRING_SPLIT([name], ' ') X
)
,ValueCnts
AS
(
    SELECT T1.id AS id1, T2.id AS id2
        ,COUNT(1) AS ValueCnt
    FROM T1
        JOIN T2
            ON T1.[value] = T2.[value]
    GROUP BY T1.id, T2.id
)
,MaxValueCnts
AS
(
    SELECT id1
        ,MAX(ValueCnt) AS MaxValueCnt
    FROM ValueCnts
    GROUP BY id1
)
SELECT T1.id, T1.[name]
    ,T2.[name] AS Name_Master
    ,T2.id AS ID_Master
    ,V.ValueCnt * 100 / X.T1cnt AS [Percent]
FROM MaxValueCnts M
    JOIN ValueCnts V
        ON M.id1 = V.id1
            AND M.MaxValueCnt = V.ValueCnt
    JOIN #tbl1 T1
        ON V.id1 = T1.id
    JOIN #tbl2 T2
        ON V.id2 = T2.id
    CROSS APPLY
    (
        SELECT COUNT(*)
        FROM STRING_SPLIT(T1.[name], ' ')
    ) X (T1cnt)
ORDER BY id;

ScottPletcher · October 4, 2022, 6:41pm

I just saw the part about listing all matches if they have the same %, meaning I need to use DENSE_RANK rather than ROW_NUMBER:


;with data1 as (
select id, name, len(name) - len(replace(name, ' ', '')) + 1 as value_count, Value
from #tbl1
cross apply String_Split(name,' ')
),
data2 as (
select id, name, len(name) - len(replace(name, ' ', '')) + 1 as value_count, Value
from #tbl2
cross apply String_Split(name,' ')
)
select d1.*, d2.*
from (
    select d1.id as id1, d2.id as id2, 
        dense_rank() over(partition by d1.id order by count(*) desc) as rank_num
    from data1 d1
    inner join data2 d2 on d2.value = d1.value
    group by d1.id, d2.id
) as matches
inner join #tbl1 d1 on d1.id = matches.id1
inner join #tbl2 d2 on d2.id = matches.id2
where rank_num = 1

barnabeck · October 5, 2022, 7:37am

Thank you Ifor! That does exactly what I need and off course you are right about the 33% comment. Great solution!

barnabeck · October 6, 2022, 10:11am

Ifor, I come back to you because I had some problems when adapting your code to my real environment, where the string is not made of colors but carries the customer name that often includes abbreviations with the legal classification of the company (S.L., S.L.U., S.A.). In order to exclude these abbreviations from the matching process I filter them in T1 and T2 by doing

CROSS APPLY STRING_SPLIT(Replace([name],'.',' '), ' ') X
WHERE len(X.[value]) > 3

As a result the calculation of the percentage gets totally wrong afterwards as it refers to the original tables where the abbreviations are still part of the string. In order to replicate this, add some of the abbreviations to the string ('Red Green Purple S.L.U.') and the above code.

Ifor · October 6, 2022, 6:23pm

Did you put the where clause in the last string_split?

    CROSS APPLY
    (
        SELECT COUNT(*)
        FROM STRING_SPLIT(T1.[name], ' ')
        WHERE len([value]) > 3
    ) X (T1cnt)

barnabeck · October 6, 2022, 8:22pm

No, in the first 2 string_splits. Inside the tables you called T1 and T2. That operation eliminates all abbreviations for the 2 main tables I compare